[jira] [Assigned] (SPARK-15959) Add the support of hive.metastore.warehouse.dir back

2016-06-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-15959:


Assignee: Yin Huai

> Add the support of hive.metastore.warehouse.dir back
> 
>
> Key: SPARK-15959
> URL: https://issues.apache.org/jira/browse/SPARK-15959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>  Labels: release_notes, releasenotes
>
> Right now, we do not load the value of this value at all 
> (https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSharedState.scala#L35-L41).
>  Let's maintain the backward compatibility by loading it if spark's warehouse 
> conf is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15961) Audit new SQL confs

2016-06-14 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-15961.
-
Resolution: Duplicate

> Audit new SQL confs 
> 
>
> Key: SPARK-15961
> URL: https://issues.apache.org/jira/browse/SPARK-15961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> Check the current SQL configuration names for inconsistencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15959) Add the support of hive.metastore.warehouse.dir back

2016-06-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15959:
-
Labels: release_notes releasenotes  (was: )

> Add the support of hive.metastore.warehouse.dir back
> 
>
> Key: SPARK-15959
> URL: https://issues.apache.org/jira/browse/SPARK-15959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>  Labels: release_notes, releasenotes
>
> Right now, we do not load the value of this value at all 
> (https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSharedState.scala#L35-L41).
>  Let's maintain the backward compatibility by loading it if spark's warehouse 
> conf is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-14 Thread Jinxia Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331168#comment-15331168
 ] 

Jinxia Liu commented on SPARK-12177:


Thanks Cody!

> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator

2016-06-14 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331132#comment-15331132
 ] 

Imran Rashid commented on SPARK-15815:
--

[~SuYan] is this the same as https://issues.apache.org/jira/browse/SPARK-15865 
?  The situation you are describing seems the same, though that doesn't only 
affect Dynamic Allocation.

Perhaps there is something better you can do with dynamic allocation as well, 
but maybe that is a different issue.  Take a look at the latest design doc I 
posted on SPARK-8426 to see if that addresses your concern.

> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -
>
> Key: SPARK-15815
> URL: https://issues.apache.org/jira/browse/SPARK-15815
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: SuYan
>Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled 
> DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor 
> are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to 
> enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 
> executors, due to we already have executor A, so the oldTargetNumExecutor  == 
> targetNumExecutor = 1, so will never add more Executors...even if Executor A 
> was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator

2016-06-14 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-15815:
-
Component/s: Scheduler

> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -
>
> Key: SPARK-15815
> URL: https://issues.apache.org/jira/browse/SPARK-15815
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: SuYan
>Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled 
> DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor 
> are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to 
> enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 
> executors, due to we already have executor A, so the oldTargetNumExecutor  == 
> targetNumExecutor = 1, so will never add more Executors...even if Executor A 
> was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2016-06-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331119#comment-15331119
 ] 

Reynold Xin commented on SPARK-13928:
-

It was never meant to be public (the comment had a note saying it's private). 
You can certainly copy the code out (just a few lines of code) and put it in 
your own project.


> Move org.apache.spark.Logging into org.apache.spark.internal.Logging
> 
>
> Key: SPARK-13928
> URL: https://issues.apache.org/jira/browse/SPARK-13928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> Logging was made private in Spark 2.0. If we move it, then users would be 
> able to create a Logging trait themselves to avoid changing their own code. 
> Alternatively, we can also provide in a compatibility package that adds 
> logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-14 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331114#comment-15331114
 ] 

Cody Koeninger commented on SPARK-12177:


[~jinx...@ebay.com] looks like that test had some flaky timing, I cleaned it up 
a bit, passed 5 times in a row locally.  Will see how it does on Jenkins

> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15961) Audit new SQL confs

2016-06-14 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-15961:
-

 Summary: Audit new SQL confs 
 Key: SPARK-15961
 URL: https://issues.apache.org/jira/browse/SPARK-15961
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Herman van Hovell
Assignee: Herman van Hovell


Check the current SQL configuration names for inconsistencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15960) Audit new SQL confs

2016-06-14 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-15960:
-

 Summary: Audit new SQL confs 
 Key: SPARK-15960
 URL: https://issues.apache.org/jira/browse/SPARK-15960
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Herman van Hovell
Assignee: Herman van Hovell


Check the current SQL configuration names for inconsistencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15824) Run 'with ... insert ... select' failed when use spark thriftserver

2016-06-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331027#comment-15331027
 ] 

Apache Spark commented on SPARK-15824:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/13678

> Run 'with ... insert ... select' failed when use spark thriftserver
> ---
>
> Key: SPARK-15824
> URL: https://issues.apache.org/jira/browse/SPARK-15824
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Weizhong
>Priority: Minor
>
> {code:sql}
> create table src(k int, v int);
> create table src_parquet(k int, v int);
> with v as (select 1, 2) insert into table src_parquet from src;
> {code}
> Will throw exception: spark.sql.execution.id is already set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-14 Thread Jinxia Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331026#comment-15331026
 ] 

Jinxia Liu commented on SPARK-12177:


[~c...@koeninger.org] thanks for the quick reply. 

1. glad to know you are checking it.

2. the kafka0.10 consumer is not difficult to use, I agree, but in most cases 
with the connector, consumer gets assigned the topics, not subscribe to them, 
the connector needs to know all the partitions of a topic, if the upstream 
kafka gets changed, the consumer code needs change manually. Maybe there is two 
sides of this issue, since you are against this, lets keep the code as it is 
now. 



> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-14 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331021#comment-15331021
 ] 

Cody Koeninger commented on SPARK-12177:


[~jinx...@ebay.com]

1. I'm already looking at that test failure, will update once I know what's 
going on.

2.  I'm really strongly against trying to hide the kafka consumer from users 
for 0.10, I don't want to be in the business of anticipating all the ways 
people will use it, nor the ways it may change.  The 0.10 consumer isn't 
particularly difficult to use, the most basic construction of it is just

new KafkaConsumer[String, String](kafkaParams)
consumer.subscribe(topics)

You don't need to know anything about partitionInfo, unless you want/need to.

> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15959) Add the support of hive.metastore.warehouse.dir back

2016-06-14 Thread Yin Huai (JIRA)
Yin Huai created SPARK-15959:


 Summary: Add the support of hive.metastore.warehouse.dir back
 Key: SPARK-15959
 URL: https://issues.apache.org/jira/browse/SPARK-15959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Critical


Right now, we do not load the value of this value at all 
(https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSharedState.scala#L35-L41).
 Let's maintain the backward compatibility by loading it if spark's warehouse 
conf is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-14 Thread Jinxia Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331007#comment-15331007
 ] 

Jinxia Liu edited comment on SPARK-12177 at 6/15/16 2:29 AM:
-

[~c...@koeninger.org] thanks for contributing the connector for kafka0.9 and 
kafka0.10.
I used your kafka0.10 connector and ran into some problems, would you mind 
looking at them?

1. when build using "mvn clean package", there is error about not passing the 
test case in DirectKafkaStreamSuite:
   offset recovery *** FAILED ***
  The code passed to eventually never returned normally. Attempted 196 times 
over 10.031047939 seconds. Last failure message: 55 did not equal 210. 
(DirectKafkaStreamSuite.scala:337)

2. another problem is(with kafka0.9 connector as well), can we add a wrapper, 
something like CreateDirectKafkaStream as in kafka0.8 connector, to wrap up the 
DirectKafkaStream constructor? 

The benefit is that user does not need to know the kafka consumer APIs, in 
order to use the connector. 

E.g.: the kafka consumer in the connector gets assigned a collection of 
TopicPartition, in most cases, all the partitions for given topic, if no 
wrapper, user needs to exploit the kafka consumer API to first retrieve the 
partitionInfo. Using the wrapper, user only needs to provide the topics, and 
such info can be passed to consumer inside the wrapper without the users 
knowledge. 



was (Author: jinx...@ebay.com):
[~c...@koeninger.org] thanks for contributing the connector for kafka0.9 and 
kafka0.10.
I used your kafka0.10 connector and ran into some problems, would you mind 
looking at them?

1. when build using "mvn clean package", there is error about not passing the 
test case in DirectKafkaStreamSuite:
   offset recovery *** FAILED ***
  The code passed to eventually never returned normally. Attempted 196 times 
over 10.031047939 seconds. Last failure message: 55 did not equal 210. 
(DirectKafkaStreamSuite.scala:337)

2. another problem is(with kafka0.9 connector as well), can we add a wrapper, 
something like CreateDirectKafkaStream in kafka0.8 connector, to wrap up the 
DirectKafkaStream constructor? 

The benefit is that user does not need to know the kafka consumer APIs, in 
order to use the connector. 

E.g.: the kafka consumer in the connector gets assigned a collection of 
TopicPartition, in most cases, all the partitions for given topic, if no 
wrapper, user needs to exploit the kafka consumer API to first retrieve the 
partitionInfo. Using the wrapper, user only needs to provide the topics, and 
such info can be passed to consumer inside the wrapper without the users 
knowledge. 


> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-14 Thread Jinxia Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331007#comment-15331007
 ] 

Jinxia Liu commented on SPARK-12177:


[~c...@koeninger.org] thanks for contributing the connector for kafka0.9 and 
kafka0.10.
I used your kafka0.10 connector and ran into some problems, would you mind 
looking at them?

1. when build using "mvn clean package", there is error about not passing the 
test case in DirectKafkaStreamSuite:
   offset recovery *** FAILED ***
  The code passed to eventually never returned normally. Attempted 196 times 
over 10.031047939 seconds. Last failure message: 55 did not equal 210. 
(DirectKafkaStreamSuite.scala:337)

2. another problem is(with kafka0.9 connector as well), can we add a wrapper, 
something like CreateDirectKafkaStream in kafka0.8 connector, to wrap up the 
DirectKafkaStream constructor? 

The benefit is that user does not need to know the kafka consumer APIs, in 
order to use the connector. 

E.g.: the kafka consumer in the connector gets assigned a collection of 
TopicPartition, in most cases, all the partitions for given topic, if no 
wrapper, user needs to exploit the kafka consumer API to first retrieve the 
partitionInfo. Using the wrapper, user only needs to provide the topics, and 
such info can be passed to consumer inside the wrapper without the users 
knowledge. 


> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-14 Thread Jinxia Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331005#comment-15331005
 ] 

Jinxia Liu commented on SPARK-12177:


[~c...@koeninger.org] thanks for contributing the connector for kafka0.9 and 
kafka0.10.
I used your kafka0.10 connector and ran into some problems, would you mind 
looking at them?

1. when build using "mvn clean package", there is error about not passing the 
test case in DirectKafkaStreamSuite:
   offset recovery *** FAILED ***
  The code passed to eventually never returned normally. Attempted 196 times 
over 10.031047939 seconds. Last failure message: 55 did not equal 210. 
(DirectKafkaStreamSuite.scala:337)

2. another problem is(with kafka0.9 connector as well), can we add a wrapper, 
something like CreateDirectKafkaStream in kafka0.8 connector, to wrap up the 
DirectKafkaStream constructor? 

The benefit is that user does not need to know the kafka consumer APIs, in 
order to use the connector. 

E.g.: the kafka consumer in the connector gets assigned a collection of 
TopicPartition, in most cases, all the partitions for given topic, if no 
wrapper, user needs to exploit the kafka consumer API to first retrieve the 
partitionInfo. Using the wrapper, user only needs to provide the topics, and 
such info can be passed to consumer inside the wrapper without the users 
knowledge. 


> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-14 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331003#comment-15331003
 ] 

Cody Koeninger commented on SPARK-12177:


I had verified basic functionality with a broker set up to require TLS, all 
that's required is setting kafkaParams appropriately.

I'm a little hesitant to claim that's "secure" for anyone's particular purpose, 
though.  E.g. enabling SSL for spark communication (so that things like the 
truststore password in kafkaParams aren't sent in cleartext from the driver) 
would probably be a good idea as well.

> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15952) "show databases" does not get sorted result

2016-06-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15952.
-
   Resolution: Fixed
 Assignee: Bo Meng
Fix Version/s: 2.0.0

> "show databases" does not get sorted result
> ---
>
> Key: SPARK-15952
> URL: https://issues.apache.org/jira/browse/SPARK-15952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Bo Meng
>Assignee: Bo Meng
> Fix For: 2.0.0
>
>
> Two issues I've found for "show databases" commands:
> 1. The returned database name list was not sorted, it only works when "like" 
> was used together; (HIVE will always return a sorted list)
> 2. When it is used as sql("show databases").show, it will output a table with 
> column named as "result", but for sql("show tables").show, it will output the 
> column name as "tableName", so I think we should be consistent and use 
> "databaseName" at least.
> I will make a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15945) Implement conversion utils in Scala/Java

2016-06-14 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-15945.
-
Resolution: Fixed

> Implement conversion utils in Scala/Java
> 
>
> Key: SPARK-15945
> URL: https://issues.apache.org/jira/browse/SPARK-15945
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This is to provide conversion utils between old/new vector columns in a 
> DataFrame. So users can use it to migrate their datasets and pipelines 
> manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15065) HiveSparkSubmitSuite's "set spark.sql.warehouse.dir" is flaky

2016-06-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15065.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> HiveSparkSubmitSuite's "set spark.sql.warehouse.dir" is flaky
> -
>
> Key: SPARK-15065
> URL: https://issues.apache.org/jira/browse/SPARK-15065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
>Priority: Critical
> Fix For: 2.0.0
>
> Attachments: log.txt
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/861/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/dir/
> There are several WARN messages like {{16/05/02 00:51:06 WARN Master: Got 
> status update for unknown executor app-20160502005054-/3}}, which are 
> suspicious. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15065) HiveSparkSubmitSuite's "set spark.sql.warehouse.dir" is flaky

2016-06-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330999#comment-15330999
 ] 

Yin Huai commented on SPARK-15065:
--

Thanks. I took a look at jenkins' history. Looks like it is good now.

> HiveSparkSubmitSuite's "set spark.sql.warehouse.dir" is flaky
> -
>
> Key: SPARK-15065
> URL: https://issues.apache.org/jira/browse/SPARK-15065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
>Priority: Critical
> Attachments: log.txt
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/861/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/dir/
> There are several WARN messages like {{16/05/02 00:51:06 WARN Master: Got 
> status update for unknown executor app-20160502005054-/3}}, which are 
> suspicious. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15631) Dataset and encoder bug fixes

2016-06-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15631.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> Dataset and encoder bug fixes
> -
>
> Key: SPARK-15631
> URL: https://issues.apache.org/jira/browse/SPARK-15631
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
> Fix For: 2.0.0
>
>
> This is an umbrella ticket for various Dataset and encoder bug fixes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12323) Don't assign default value for non-nullable columns of a Dataset

2016-06-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12323.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> Don't assign default value for non-nullable columns of a Dataset
> 
>
> Key: SPARK-12323
> URL: https://issues.apache.org/jira/browse/SPARK-12323
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> For a field of a Dataset, if it's specified as non-nullable in the schema of 
> the Dataset, we shouldn't assign default value for it if input data contain 
> null. Instead, a runtime exception with nice error message should be thrown, 
> and ask the user to use {{Option}} or nullable types (e.g., 
> {{java.lang.Integer}} instead of {{scala.Int}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15011) org.apache.spark.sql.hive.StatisticsSuite.analyze MetastoreRelations fails when hadoop 2.3 or hadoop 2.4 is used

2016-06-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15011.
-
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.0.0

> org.apache.spark.sql.hive.StatisticsSuite.analyze MetastoreRelations fails 
> when hadoop 2.3 or hadoop 2.4 is used
> 
>
> Key: SPARK-15011
> URL: https://issues.apache.org/jira/browse/SPARK-15011
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Reporter: Yin Huai
>Assignee: Herman van Hovell
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.0.0
>
>
> Let's disable it first.
> https://spark-tests.appspot.com/tests/org.apache.spark.sql.hive.StatisticsSuite/analyze%20MetastoreRelations



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15933) Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer

2016-06-14 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-15933.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13653
[https://github.com/apache/spark/pull/13653]

> Refactor reader-writer interface for streaming DFs to use 
> DataStreamReader/Writer
> -
>
> Key: SPARK-15933
> URL: https://issues.apache.org/jira/browse/SPARK-15933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
>
> Currently, the DataFrameReader/Writer has method that are needed for 
> streaming and non-streaming DFs. This is quite awkward because each method in 
> them through runtime exception for one case or the other. So rather having 
> half the methods throw runtime exceptions, its just better to have a 
> different reader/writer API for streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15958) Make initial buffer size for the Sorter configurable

2016-06-14 Thread Sital Kedia (JIRA)
Sital Kedia created SPARK-15958:
---

 Summary: Make initial buffer size for the Sorter configurable
 Key: SPARK-15958
 URL: https://issues.apache.org/jira/browse/SPARK-15958
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Sital Kedia


Currently the initial buffer size in the sorter is hard coded inside the code 
(https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java#L88)
 and is too small for large workload. As a result, the sorter spends 
significant time expanding the buffer size and copying the data. It would be 
useful to have it configurable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15957) RFormula supports forcing to index label

2016-06-14 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-15957:

Assignee: (was: Yanbo Liang)

> RFormula supports forcing to index label
> 
>
> Key: SPARK-15957
> URL: https://issues.apache.org/jira/browse/SPARK-15957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> RFormula will index label only when it is string type currently. If the label 
> is numeric type and we use RFormula to present a classification model, there 
> is no label attributes in label column metadata. The label attributes are 
> useful when making prediction for classification, so we can force to index 
> label by {{StringIndexer}} whether it is numeric or string type for 
> classification. Then SparkR wrappers can extract label attributes from label 
> column metadata successfully. This feature can help us to fix bug similar 
> with SPARK-15153.
> For regression, we will still to keep label as numeric type.
> In this PR, we add a param indexLabel to control whether to force to index 
> label for RFormula.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15957) RFormula supports forcing to index label

2016-06-14 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-15957:

Description: 
RFormula will index label only when it is string type currently. If the label 
is numeric type and we use RFormula to present a classification model, there is 
no label attributes in label column metadata. The label attributes are useful 
when making prediction for classification, so we can force to index label by 
{{StringIndexer}} whether it is numeric or string type for classification. Then 
SparkR wrappers can extract label attributes from label column metadata 
successfully. This feature can help us to fix bug similar with SPARK-15153.
For regression, we will still to keep label as numeric type.
In this PR, we add a param indexLabel to control whether to force to index 
label for RFormula.

  was:
RFormula will index label only when it is string type. If the label is numeric 
type and we use RFormula to present a classification model, we can not extract 
label attributes from the label column metadata successfully. The label 
attributes are useful when make prediction for classification, so we can force 
to index label by {{StringIndexer}} whether it is numeric or string type for 
classification. Then SparkR wrappers can extract label attributes from the 
column metadata successfully. This feature can help us to fix bug similar with 
SPARK-15153.
For regression, we will still to keep label as numeric type.
We should add a param to control whether to force to index label for RFormula.


> RFormula supports forcing to index label
> 
>
> Key: SPARK-15957
> URL: https://issues.apache.org/jira/browse/SPARK-15957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> RFormula will index label only when it is string type currently. If the label 
> is numeric type and we use RFormula to present a classification model, there 
> is no label attributes in label column metadata. The label attributes are 
> useful when making prediction for classification, so we can force to index 
> label by {{StringIndexer}} whether it is numeric or string type for 
> classification. Then SparkR wrappers can extract label attributes from label 
> column metadata successfully. This feature can help us to fix bug similar 
> with SPARK-15153.
> For regression, we will still to keep label as numeric type.
> In this PR, we add a param indexLabel to control whether to force to index 
> label for RFormula.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15956) When unwrapping ORC avoid pattern matching at runtime

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15956:


Assignee: Apache Spark

> When unwrapping ORC avoid pattern matching at runtime
> -
>
> Key: SPARK-15956
> URL: https://issues.apache.org/jira/browse/SPARK-15956
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Brian Cho
>Assignee: Apache Spark
>Priority: Minor
>
> When unwrapping ORC values, pattern matching for each data value at runtime 
> hurts performance. This should be avoided.
> Instead, we can run pattern matching once and return a function that is 
> subsequently used to unwrap each data value. This is already implemented for 
> certain primitive types. We should implement for the remaining types, 
> including complex types (e.g, list, map).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15956) When unwrapping ORC avoid pattern matching at runtime

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15956:


Assignee: Apache Spark

> When unwrapping ORC avoid pattern matching at runtime
> -
>
> Key: SPARK-15956
> URL: https://issues.apache.org/jira/browse/SPARK-15956
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Brian Cho
>Assignee: Apache Spark
>Priority: Minor
>
> When unwrapping ORC values, pattern matching for each data value at runtime 
> hurts performance. This should be avoided.
> Instead, we can run pattern matching once and return a function that is 
> subsequently used to unwrap each data value. This is already implemented for 
> certain primitive types. We should implement for the remaining types, 
> including complex types (e.g, list, map).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15956) When unwrapping ORC avoid pattern matching at runtime

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15956:


Assignee: (was: Apache Spark)

> When unwrapping ORC avoid pattern matching at runtime
> -
>
> Key: SPARK-15956
> URL: https://issues.apache.org/jira/browse/SPARK-15956
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Brian Cho
>Priority: Minor
>
> When unwrapping ORC values, pattern matching for each data value at runtime 
> hurts performance. This should be avoided.
> Instead, we can run pattern matching once and return a function that is 
> subsequently used to unwrap each data value. This is already implemented for 
> certain primitive types. We should implement for the remaining types, 
> including complex types (e.g, list, map).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15956) When unwrapping ORC avoid pattern matching at runtime

2016-06-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330946#comment-15330946
 ] 

Apache Spark commented on SPARK-15956:
--

User 'dafrista' has created a pull request for this issue:
https://github.com/apache/spark/pull/13676

> When unwrapping ORC avoid pattern matching at runtime
> -
>
> Key: SPARK-15956
> URL: https://issues.apache.org/jira/browse/SPARK-15956
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Brian Cho
>Priority: Minor
>
> When unwrapping ORC values, pattern matching for each data value at runtime 
> hurts performance. This should be avoided.
> Instead, we can run pattern matching once and return a function that is 
> subsequently used to unwrap each data value. This is already implemented for 
> certain primitive types. We should implement for the remaining types, 
> including complex types (e.g, list, map).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3451) spark-submit should support specifying glob wildcards in the --jars CLI option

2016-06-14 Thread Kun Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330945#comment-15330945
 ] 

Kun Liu commented on SPARK-3451:


Vote for this feature. 
If no one will, I may try as my first time to contribute to the Spark community.

> spark-submit should support specifying glob wildcards in the --jars CLI option
> --
>
> Key: SPARK-3451
> URL: https://issues.apache.org/jira/browse/SPARK-3451
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Submit
>Affects Versions: 1.0.2
>Reporter: wolfgang hoschek
>Priority: Minor
>
> spark-submit should support specifying glob wildcards in the --jars CLI 
> option, e.g. --jars /opt/myapp/*.jar
> This would simplify usage for enterprise customers, for example in 
> combination with being able to specify --jars multiple times as described in 
> https://issues.apache.org/jira/browse/SPARK-3450, like so:
> {code}
> my-spark-submit.sh:
> spark-submit --jars /opt/myapp/*.jar "$@"
> {code}
> Example usage:
> {code}
> my-spark-submit.sh --jars myUserDefinedFunction.jar 
> {code}
> The relevant enhancement code might go into SparkSubmitArguments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15957) RFormula supports forcing to index label

2016-06-14 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-15957:

Description: 
RFormula will index label only when it is string type. If the label is numeric 
type and we use RFormula to present a classification model, we can not extract 
label attributes from the label column metadata successfully. The label 
attributes are useful when make prediction for classification, so we can force 
to index label by {{StringIndexer}} whether it is numeric or string type for 
classification. Then SparkR wrappers can extract label attributes from the 
column metadata successfully. This feature can help us to fix bug similar with 
SPARK-15153.
For regression, we will still to keep label as numeric type.
We should add a param to control whether to force to index label for RFormula.

  was:
RFormula will index label only when it is string type. If the label is numeric 
type and we use RFormula to present a classification model, we can not extract 
label attributes from the label column metadata successfully. The label 
attributes are useful when make prediction for classification, so we can force 
to index label by {StringIndexer} whether it is numeric or string type for 
classification. Then SparkR wrappers can extract label attributes from the 
column metadata successfully. This feature can help us to fix bug similar with 
SPARK-15153.
For regression, we will still to keep label as numeric type.
We should add a param to control whether to force to index label for RFormula.


> RFormula supports forcing to index label
> 
>
> Key: SPARK-15957
> URL: https://issues.apache.org/jira/browse/SPARK-15957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> RFormula will index label only when it is string type. If the label is 
> numeric type and we use RFormula to present a classification model, we can 
> not extract label attributes from the label column metadata successfully. The 
> label attributes are useful when make prediction for classification, so we 
> can force to index label by {{StringIndexer}} whether it is numeric or string 
> type for classification. Then SparkR wrappers can extract label attributes 
> from the column metadata successfully. This feature can help us to fix bug 
> similar with SPARK-15153.
> For regression, we will still to keep label as numeric type.
> We should add a param to control whether to force to index label for RFormula.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15957) RFormula supports forcing to index label

2016-06-14 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-15957:

Description: 
RFormula will index label only when it is string type. If the label is numeric 
type and we use RFormula to present a classification model, we can not extract 
label attributes from the label column metadata successfully. The label 
attributes are useful when make prediction for classification, so we can force 
to index label by {StringIndexer} whether it is numeric or string type for 
classification. Then SparkR wrappers can extract label attributes from the 
column metadata successfully. This feature can help us to fix bug similar with 
SPARK-15153.
For regression, we will still to keep label as numeric type.
We should add a param to control whether to force to index label for RFormula.

  was:
RFormula will index label only when it is string type. If the label is numeric 
type and we use RFormula to present a classification model, we can not extract 
label attributes from the label column metadata successfully. The label 
attributes are useful, so we can force to index label whether it is numeric or 
string type for classification. Then SparkR wrappers can extract label 
attributes from the column metadata successfully. This feature can help us to 
fix bug similar with SPARK-15153.
For regression, we will still to keep numeric type.
We should add a param to control whether to force to index label for RFormula.


> RFormula supports forcing to index label
> 
>
> Key: SPARK-15957
> URL: https://issues.apache.org/jira/browse/SPARK-15957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> RFormula will index label only when it is string type. If the label is 
> numeric type and we use RFormula to present a classification model, we can 
> not extract label attributes from the label column metadata successfully. The 
> label attributes are useful when make prediction for classification, so we 
> can force to index label by {StringIndexer} whether it is numeric or string 
> type for classification. Then SparkR wrappers can extract label attributes 
> from the column metadata successfully. This feature can help us to fix bug 
> similar with SPARK-15153.
> For regression, we will still to keep label as numeric type.
> We should add a param to control whether to force to index label for RFormula.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15957) RFormula supports forcing to index label

2016-06-14 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-15957:

Description: 
RFormula will index label only when it is string type. If the label is numeric 
type and we use RFormula to present a classification model, we can not extract 
label attributes from the label column metadata successfully. The label 
attributes are useful, so we can force to index label whether it is numeric or 
string type for classification. Then SparkR wrappers can extract label 
attributes from the column metadata successfully. This feature can help us to 
fix bug similar with SPARK-15153.
For regression, we will still to keep numeric type.
We should add a param to control whether to force to index label for RFormula.

  was:Add param to make users can force to index label whether it is numeric or 
string. For classification algorithms, we force to index label by setting it 
with true.


> RFormula supports forcing to index label
> 
>
> Key: SPARK-15957
> URL: https://issues.apache.org/jira/browse/SPARK-15957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> RFormula will index label only when it is string type. If the label is 
> numeric type and we use RFormula to present a classification model, we can 
> not extract label attributes from the label column metadata successfully. The 
> label attributes are useful, so we can force to index label whether it is 
> numeric or string type for classification. Then SparkR wrappers can extract 
> label attributes from the column metadata successfully. This feature can help 
> us to fix bug similar with SPARK-15153.
> For regression, we will still to keep numeric type.
> We should add a param to control whether to force to index label for RFormula.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15927) Eliminate redundant code in DAGScheduler's getParentStages and getAncestorShuffleDependencies methods.

2016-06-14 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-15927.

   Resolution: Fixed
Fix Version/s: 2.1.0

Fixed by 
https://github.com/apache/spark/commit/5d50d4f0f9db3e6cc7c51e35cdb2d12daa4fd108

> Eliminate redundant code in DAGScheduler's getParentStages and 
> getAncestorShuffleDependencies methods.
> --
>
> Key: SPARK-15927
> URL: https://issues.apache.org/jira/browse/SPARK-15927
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 2.1.0
>
>
> The getParentStages and getAncestorShuffleDependencies methods have a lot of 
> repeated code to traverse the dependency graph.  We should create a function 
> that they can both call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15957) RFormula supports forcing to index label

2016-06-14 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-15957:
---

Assignee: Yanbo Liang  (was: Apache Spark)

> RFormula supports forcing to index label
> 
>
> Key: SPARK-15957
> URL: https://issues.apache.org/jira/browse/SPARK-15957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Add param to make users can force to index label whether it is numeric or 
> string. For classification algorithms, we force to index label by setting it 
> with true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15957) RFormula supports forcing to index label

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15957:


Assignee: Apache Spark

> RFormula supports forcing to index label
> 
>
> Key: SPARK-15957
> URL: https://issues.apache.org/jira/browse/SPARK-15957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Add param to make users can force to index label whether it is numeric or 
> string. For classification algorithms, we force to index label by setting it 
> with true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15957) RFormula supports forcing to index label

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15957:


Assignee: (was: Apache Spark)

> RFormula supports forcing to index label
> 
>
> Key: SPARK-15957
> URL: https://issues.apache.org/jira/browse/SPARK-15957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> Add param to make users can force to index label whether it is numeric or 
> string. For classification algorithms, we force to index label by setting it 
> with true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15957) RFormula supports forcing to index label

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15957:


Assignee: Apache Spark

> RFormula supports forcing to index label
> 
>
> Key: SPARK-15957
> URL: https://issues.apache.org/jira/browse/SPARK-15957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Add param to make users can force to index label whether it is numeric or 
> string. For classification algorithms, we force to index label by setting it 
> with true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15957) RFormula supports forcing to index label

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15957:


Assignee: Apache Spark

> RFormula supports forcing to index label
> 
>
> Key: SPARK-15957
> URL: https://issues.apache.org/jira/browse/SPARK-15957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Add param to make users can force to index label whether it is numeric or 
> string. For classification algorithms, we force to index label by setting it 
> with true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15957) RFormula supports forcing to index label

2016-06-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330896#comment-15330896
 ] 

Apache Spark commented on SPARK-15957:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/13675

> RFormula supports forcing to index label
> 
>
> Key: SPARK-15957
> URL: https://issues.apache.org/jira/browse/SPARK-15957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> Add param to make users can force to index label whether it is numeric or 
> string. For classification algorithms, we force to index label by setting it 
> with true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15957) RFormula supports forcing to index label

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15957:


Assignee: (was: Apache Spark)

> RFormula supports forcing to index label
> 
>
> Key: SPARK-15957
> URL: https://issues.apache.org/jira/browse/SPARK-15957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> Add param to make users can force to index label whether it is numeric or 
> string. For classification algorithms, we force to index label by setting it 
> with true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15957) RFormula supports forcing to index label

2016-06-14 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-15957:
---

 Summary: RFormula supports forcing to index label
 Key: SPARK-15957
 URL: https://issues.apache.org/jira/browse/SPARK-15957
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Yanbo Liang


Add param to make users can force to index label whether it is numeric or 
string. For classification algorithms, we force to index label by setting it 
with true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15956) When unwrapping ORC avoid pattern matching at runtime

2016-06-14 Thread Brian Cho (JIRA)
Brian Cho created SPARK-15956:
-

 Summary: When unwrapping ORC avoid pattern matching at runtime
 Key: SPARK-15956
 URL: https://issues.apache.org/jira/browse/SPARK-15956
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Brian Cho
Priority: Minor


When unwrapping ORC values, pattern matching for each data value at runtime 
hurts performance. This should be avoided.

Instead, we can run pattern matching once and return a function that is 
subsequently used to unwrap each data value. This is already implemented for 
certain primitive types. We should implement for the remaining types, including 
complex types (e.g, list, map).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15954) TestHive has issues being used in PySpark

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15954:


Assignee: Apache Spark

> TestHive has issues being used in PySpark
> -
>
> Key: SPARK-15954
> URL: https://issues.apache.org/jira/browse/SPARK-15954
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: holdenk
>Assignee: Apache Spark
>
> SPARK-15745 made TestHive unreliable from PySpark test cases, to support it 
> we should allow both resource or system property based lookup for loading the 
> hive file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15954) TestHive has issues being used in PySpark

2016-06-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330822#comment-15330822
 ] 

Apache Spark commented on SPARK-15954:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/12938

> TestHive has issues being used in PySpark
> -
>
> Key: SPARK-15954
> URL: https://issues.apache.org/jira/browse/SPARK-15954
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: holdenk
>
> SPARK-15745 made TestHive unreliable from PySpark test cases, to support it 
> we should allow both resource or system property based lookup for loading the 
> hive file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15954) TestHive has issues being used in PySpark

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15954:


Assignee: (was: Apache Spark)

> TestHive has issues being used in PySpark
> -
>
> Key: SPARK-15954
> URL: https://issues.apache.org/jira/browse/SPARK-15954
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: holdenk
>
> SPARK-15745 made TestHive unreliable from PySpark test cases, to support it 
> we should allow both resource or system property based lookup for loading the 
> hive file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15955) Failed Spark application returns with exitcode equals to zero

2016-06-14 Thread Yesha Vora (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesha Vora updated SPARK-15955:
---
Summary: Failed Spark application returns with exitcode equals to zero  
(was: Failed Spark application returns with client console equals zero)

> Failed Spark application returns with exitcode equals to zero
> -
>
> Key: SPARK-15955
> URL: https://issues.apache.org/jira/browse/SPARK-15955
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Scenario:
> * Set up cluster with wire-encryption enabled.
> * set 'spark.authenticate.enableSaslEncryption' = 'false' and 
> 'spark.shuffle.service.enabled' :'true'
> * run sparkPi application.
> {code}
> client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
> diagnostics: Max number of executor failures (3) reached
> ApplicationMaster host: xx.xx.xx.xxx
> ApplicationMaster RPC port: 0
> queue: default
> start time: 1465941051976
> final status: FAILED
> tracking URL: https://xx.xx.xx.xxx:8090/proxy/application_1465925772890_0016/
> user: hrt_qa
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1465925772890_0016 finished with failed status
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1092)
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1139)
> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> INFO ShutdownHookManager: Shutdown hook called{code}
> This spark application exits with exitcode = 0. Failed application should not 
> return with exitcode = 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15954) TestHive has issues being used in PySpark

2016-06-14 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330821#comment-15330821
 ] 

holdenk commented on SPARK-15954:
-

See related PR https://github.com/apache/spark/pull/12938

> TestHive has issues being used in PySpark
> -
>
> Key: SPARK-15954
> URL: https://issues.apache.org/jira/browse/SPARK-15954
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: holdenk
>
> SPARK-15745 made TestHive unreliable from PySpark test cases, to support it 
> we should allow both resource or system property based lookup for loading the 
> hive file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15955) Failed Spark application returns with client console equals zero

2016-06-14 Thread Yesha Vora (JIRA)
Yesha Vora created SPARK-15955:
--

 Summary: Failed Spark application returns with client console 
equals zero
 Key: SPARK-15955
 URL: https://issues.apache.org/jira/browse/SPARK-15955
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
Reporter: Yesha Vora


Scenario:
* Set up cluster with wire-encryption enabled.
* set 'spark.authenticate.enableSaslEncryption' = 'false' and 
'spark.shuffle.service.enabled' :'true'
* run sparkPi application.

{code}
client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
diagnostics: Max number of executor failures (3) reached
ApplicationMaster host: xx.xx.xx.xxx
ApplicationMaster RPC port: 0
queue: default
start time: 1465941051976
final status: FAILED
tracking URL: https://xx.xx.xx.xxx:8090/proxy/application_1465925772890_0016/
user: hrt_qa
Exception in thread "main" org.apache.spark.SparkException: Application 
application_1465925772890_0016 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1092)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1139)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
INFO ShutdownHookManager: Shutdown hook called{code}

This spark application exits with exitcode = 0. Failed application should not 
return with exitcode = 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15954) TestHive has issues being used in PySpark

2016-06-14 Thread holdenk (JIRA)
holdenk created SPARK-15954:
---

 Summary: TestHive has issues being used in PySpark
 Key: SPARK-15954
 URL: https://issues.apache.org/jira/browse/SPARK-15954
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: holdenk


SPARK-15745 made TestHive unreliable from PySpark test cases, to support it we 
should allow both resource or system property based lookup for loading the hive 
file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15953) Renamed ContinuousQuery to StreamingQuery for simplicity

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15953:


Assignee: Apache Spark  (was: Tathagata Das)

> Renamed ContinuousQuery to StreamingQuery for simplicity
> 
>
> Key: SPARK-15953
> URL: https://issues.apache.org/jira/browse/SPARK-15953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> Make the API more intuitive by removing the term "Continuous".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15953) Renamed ContinuousQuery to StreamingQuery for simplicity

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15953:


Assignee: Tathagata Das  (was: Apache Spark)

> Renamed ContinuousQuery to StreamingQuery for simplicity
> 
>
> Key: SPARK-15953
> URL: https://issues.apache.org/jira/browse/SPARK-15953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Make the API more intuitive by removing the term "Continuous".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15953) Renamed ContinuousQuery to StreamingQuery for simplicity

2016-06-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330805#comment-15330805
 ] 

Apache Spark commented on SPARK-15953:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13673

> Renamed ContinuousQuery to StreamingQuery for simplicity
> 
>
> Key: SPARK-15953
> URL: https://issues.apache.org/jira/browse/SPARK-15953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Make the API more intuitive by removing the term "Continuous".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15953) Renamed ContinuousQuery to StreamingQuery for simplicity

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15953:


Assignee: Apache Spark  (was: Tathagata Das)

> Renamed ContinuousQuery to StreamingQuery for simplicity
> 
>
> Key: SPARK-15953
> URL: https://issues.apache.org/jira/browse/SPARK-15953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> Make the API more intuitive by removing the term "Continuous".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15953) Renamed ContinuousQuery to StreamingQuery for simplicity

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15953:


Assignee: Tathagata Das  (was: Apache Spark)

> Renamed ContinuousQuery to StreamingQuery for simplicity
> 
>
> Key: SPARK-15953
> URL: https://issues.apache.org/jira/browse/SPARK-15953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Make the API more intuitive by removing the term "Continuous".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15953) Renamed ContinuousQuery to StreamingQuery for simplicity

2016-06-14 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-15953:
--
Target Version/s: 2.0.0

> Renamed ContinuousQuery to StreamingQuery for simplicity
> 
>
> Key: SPARK-15953
> URL: https://issues.apache.org/jira/browse/SPARK-15953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Make the API more intuitive by removing the term "Continuous".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15953) Renamed ContinuousQuery to StreamingQuery for simplicity

2016-06-14 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-15953:
-

 Summary: Renamed ContinuousQuery to StreamingQuery for simplicity
 Key: SPARK-15953
 URL: https://issues.apache.org/jira/browse/SPARK-15953
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das


Make the API more intuitive by removing the term "Continuous".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15933) Refactor reader-writer interface for streaming DFs to use DataStreamReader/Writer

2016-06-14 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-15933:
--
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-8360

> Refactor reader-writer interface for streaming DFs to use 
> DataStreamReader/Writer
> -
>
> Key: SPARK-15933
> URL: https://issues.apache.org/jira/browse/SPARK-15933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Currently, the DataFrameReader/Writer has method that are needed for 
> streaming and non-streaming DFs. This is quite awkward because each method in 
> them through runtime exception for one case or the other. So rather having 
> half the methods throw runtime exceptions, its just better to have a 
> different reader/writer API for streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException

2016-06-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330723#comment-15330723
 ] 

Apache Spark commented on SPARK-15046:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13669

> When running hive-thriftserver with yarn on a secure cluster the workers fail 
> with java.lang.NumberFormatException
> --
>
> Key: SPARK-15046
> URL: https://issues.apache.org/jira/browse/SPARK-15046
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Trystan Leftwich
>Priority: Blocker
>
> When running hive-thriftserver with yarn on a secure cluster 
> (spark.yarn.principal and spark.yarn.keytab are set) the workers fail with 
> the following error.
> {code}
> 16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
> java.lang.NumberFormatException: For input string: "86400079ms"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:441)
>   at java.lang.Long.parseLong(Long.java:483)
>   at 
> scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
>   at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.SparkConf.getLong(SparkConf.scala:380)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15741) PySpark Cleanup of _setDefault with seed=None

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15741:


Assignee: (was: Apache Spark)

> PySpark Cleanup of _setDefault with seed=None
> -
>
> Key: SPARK-15741
> URL: https://issues.apache.org/jira/browse/SPARK-15741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
>
> Several places in PySpark ML have Params._setDefault with a seed param equal 
> to {{None}}.  This is unnecessary as it will translate to a {{0}} even though 
> the param has a fixed value based by on the hashed classname by default.  
> Currently, the ALS doc test output depends on this happening and would be 
> more clear and stable if it was explicitly set to {{0}}.  These should be 
> cleaned up for stability and consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15741) PySpark Cleanup of _setDefault with seed=None

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15741:


Assignee: Apache Spark

> PySpark Cleanup of _setDefault with seed=None
> -
>
> Key: SPARK-15741
> URL: https://issues.apache.org/jira/browse/SPARK-15741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Minor
>
> Several places in PySpark ML have Params._setDefault with a seed param equal 
> to {{None}}.  This is unnecessary as it will translate to a {{0}} even though 
> the param has a fixed value based by on the hashed classname by default.  
> Currently, the ALS doc test output depends on this happening and would be 
> more clear and stable if it was explicitly set to {{0}}.  These should be 
> cleaned up for stability and consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15741) PySpark Cleanup of _setDefault with seed=None

2016-06-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330696#comment-15330696
 ] 

Apache Spark commented on SPARK-15741:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/13672

> PySpark Cleanup of _setDefault with seed=None
> -
>
> Key: SPARK-15741
> URL: https://issues.apache.org/jira/browse/SPARK-15741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
>
> Several places in PySpark ML have Params._setDefault with a seed param equal 
> to {{None}}.  This is unnecessary as it will translate to a {{0}} even though 
> the param has a fixed value based by on the hashed classname by default.  
> Currently, the ALS doc test output depends on this happening and would be 
> more clear and stable if it was explicitly set to {{0}}.  These should be 
> cleaned up for stability and consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-14 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330675#comment-15330675
 ] 

Mark Grover commented on SPARK-12177:
-

bq. I can rename it to spark-streaming-kafka-0-10 to match the change made
for the 0.8 consumer
Thanks!

Mark have you (or anyone else) actually tried this PR out using TLS?
bq. No, I haven't, sorry.

> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15952) "show databases" does not get sorted result

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15952:


Assignee: (was: Apache Spark)

> "show databases" does not get sorted result
> ---
>
> Key: SPARK-15952
> URL: https://issues.apache.org/jira/browse/SPARK-15952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Bo Meng
>
> Two issues I've found for "show databases" commands:
> 1. The returned database name list was not sorted, it only works when "like" 
> was used together; (HIVE will always return a sorted list)
> 2. When it is used as sql("show databases").show, it will output a table with 
> column named as "result", but for sql("show tables").show, it will output the 
> column name as "tableName", so I think we should be consistent and use 
> "databaseName" at least.
> I will make a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15952) "show databases" does not get sorted result

2016-06-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330663#comment-15330663
 ] 

Apache Spark commented on SPARK-15952:
--

User 'bomeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/13671

> "show databases" does not get sorted result
> ---
>
> Key: SPARK-15952
> URL: https://issues.apache.org/jira/browse/SPARK-15952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Bo Meng
>
> Two issues I've found for "show databases" commands:
> 1. The returned database name list was not sorted, it only works when "like" 
> was used together; (HIVE will always return a sorted list)
> 2. When it is used as sql("show databases").show, it will output a table with 
> column named as "result", but for sql("show tables").show, it will output the 
> column name as "tableName", so I think we should be consistent and use 
> "databaseName" at least.
> I will make a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15952) "show databases" does not get sorted result

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15952:


Assignee: Apache Spark

> "show databases" does not get sorted result
> ---
>
> Key: SPARK-15952
> URL: https://issues.apache.org/jira/browse/SPARK-15952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Bo Meng
>Assignee: Apache Spark
>
> Two issues I've found for "show databases" commands:
> 1. The returned database name list was not sorted, it only works when "like" 
> was used together; (HIVE will always return a sorted list)
> 2. When it is used as sql("show databases").show, it will output a table with 
> column named as "result", but for sql("show tables").show, it will output the 
> column name as "tableName", so I think we should be consistent and use 
> "databaseName" at least.
> I will make a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15741) PySpark Cleanup of _setDefault with seed=None

2016-06-14 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-15741:
-
Description: Several places in PySpark ML have Params._setDefault with a 
seed param equal to {{None}}.  This is unnecessary as it will translate to a 
{{0}} even though the param has a fixed value based by on the hashed classname 
by default.  Currently, the ALS doc test output depends on this happening and 
would be more clear and stable if it was explicitly set to {{0}}.  These should 
be cleaned up for stability and consistency.  (was: Calling Params._setDefault 
with a param equal to {{None}} will be ignored internally silently.  There are 
several cases where this is done with the {{seed}} param, making it seem like 
it might do something.  These cases should be removed for the sake of 
consistency.)

> PySpark Cleanup of _setDefault with seed=None
> -
>
> Key: SPARK-15741
> URL: https://issues.apache.org/jira/browse/SPARK-15741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
>
> Several places in PySpark ML have Params._setDefault with a seed param equal 
> to {{None}}.  This is unnecessary as it will translate to a {{0}} even though 
> the param has a fixed value based by on the hashed classname by default.  
> Currently, the ALS doc test output depends on this happening and would be 
> more clear and stable if it was explicitly set to {{0}}.  These should be 
> cleaned up for stability and consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-15741) PySpark Cleanup of _setDefault with seed=None

2016-06-14 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reopened SPARK-15741:
--

Reopened as I feel this still should be cleaned up.

> PySpark Cleanup of _setDefault with seed=None
> -
>
> Key: SPARK-15741
> URL: https://issues.apache.org/jira/browse/SPARK-15741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
>
> Several places in PySpark ML have Params._setDefault with a seed param equal 
> to {{None}}.  This is unnecessary as it will translate to a {{0}} even though 
> the param has a fixed value based by on the hashed classname by default.  
> Currently, the ALS doc test output depends on this happening and would be 
> more clear and stable if it was explicitly set to {{0}}.  These should be 
> cleaned up for stability and consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15952) "show databases" does not get sorted result

2016-06-14 Thread Bo Meng (JIRA)
Bo Meng created SPARK-15952:
---

 Summary: "show databases" does not get sorted result
 Key: SPARK-15952
 URL: https://issues.apache.org/jira/browse/SPARK-15952
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Bo Meng


Two issues I've found for "show databases" commands:
1. The returned database name list was not sorted, it only works when "like" 
was used together; (HIVE will always return a sorted list)
2. When it is used as sql("show databases").show, it will output a table with 
column named as "result", but for sql("show tables").show, it will output the 
column name as "tableName", so I think we should be consistent and use 
"databaseName" at least.

I will make a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count

2016-06-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reopened SPARK-15892:
---

> Incorrectly merged AFTAggregator with zero total count
> --
>
> Key: SPARK-15892
> URL: https://issues.apache.org/jira/browse/SPARK-15892
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML, PySpark
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Hyukjin Kwon
> Fix For: 2.0.0
>
>
> Running the example (after the fix in 
> [https://github.com/apache/spark/pull/13393]) causes this failure:
> {code}
> Traceback (most recent call last):
>   
>   File 
> "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py",
>  line 49, in 
> model = aft.fit(training)
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", 
> line 64, in fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 213, in _fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
>   File 
> "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 933, in __call__
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", 
> line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number 
> of instances should be greater than 0.0, but got 0.'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count

2016-06-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15892:
--
Fix Version/s: (was: 1.6.2)

> Incorrectly merged AFTAggregator with zero total count
> --
>
> Key: SPARK-15892
> URL: https://issues.apache.org/jira/browse/SPARK-15892
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML, PySpark
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Hyukjin Kwon
> Fix For: 2.0.0
>
>
> Running the example (after the fix in 
> [https://github.com/apache/spark/pull/13393]) causes this failure:
> {code}
> Traceback (most recent call last):
>   
>   File 
> "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py",
>  line 49, in 
> model = aft.fit(training)
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", 
> line 64, in fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 213, in _fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
>   File 
> "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 933, in __call__
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", 
> line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number 
> of instances should be greater than 0.0, but got 0.'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15951) Change Executors Page to use datatables to support sorting columns and searching

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15951:


Assignee: Apache Spark

> Change Executors Page to use datatables to support sorting columns and 
> searching
> 
>
> Key: SPARK-15951
> URL: https://issues.apache.org/jira/browse/SPARK-15951
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Kishor Patil
>Assignee: Apache Spark
> Fix For: 2.1.0
>
>
> Support column sort and search for Executors Server using jQuery DataTable 
> and REST API. Before this commit, the executors page was generated hard-coded 
> html and can not support search, also, the sorting was disabled if there is 
> any application that has more than one attempt. Supporting search and sort 
> (over all applications rather than the 20 entries in the current page) in any 
> case will greatly improve user experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15951) Change Executors Page to use datatables to support sorting columns and searching

2016-06-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330629#comment-15330629
 ] 

Apache Spark commented on SPARK-15951:
--

User 'kishorvpatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/13670

> Change Executors Page to use datatables to support sorting columns and 
> searching
> 
>
> Key: SPARK-15951
> URL: https://issues.apache.org/jira/browse/SPARK-15951
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Kishor Patil
> Fix For: 2.1.0
>
>
> Support column sort and search for Executors Server using jQuery DataTable 
> and REST API. Before this commit, the executors page was generated hard-coded 
> html and can not support search, also, the sorting was disabled if there is 
> any application that has more than one attempt. Supporting search and sort 
> (over all applications rather than the 20 entries in the current page) in any 
> case will greatly improve user experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15951) Change Executors Page to use datatables to support sorting columns and searching

2016-06-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15951:


Assignee: (was: Apache Spark)

> Change Executors Page to use datatables to support sorting columns and 
> searching
> 
>
> Key: SPARK-15951
> URL: https://issues.apache.org/jira/browse/SPARK-15951
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Kishor Patil
> Fix For: 2.1.0
>
>
> Support column sort and search for Executors Server using jQuery DataTable 
> and REST API. Before this commit, the executors page was generated hard-coded 
> html and can not support search, also, the sorting was disabled if there is 
> any application that has more than one attempt. Supporting search and sort 
> (over all applications rather than the 20 entries in the current page) in any 
> case will greatly improve user experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-14 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330552#comment-15330552
 ] 

Cody Koeninger commented on SPARK-12177:


I can rename it to spark-streaming-kafka-0-10 to match the change made
for the 0.8 consumer

Mark have you (or anyone else) actually tried this PR out using TLS?



> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2016-06-14 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330550#comment-15330550
 ] 

Russell Alexander Spitzer commented on SPARK-13928:
---

So users(like me ;) ) need to write their own Logging trait now? I'm a little 
confused based on the description. 

> Move org.apache.spark.Logging into org.apache.spark.internal.Logging
> 
>
> Key: SPARK-13928
> URL: https://issues.apache.org/jira/browse/SPARK-13928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> Logging was made private in Spark 2.0. If we move it, then users would be 
> able to create a Logging trait themselves to avoid changing their own code. 
> Alternatively, we can also provide in a compatibility package that adds 
> logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-14 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330529#comment-15330529
 ] 

Shixiong Zhu commented on SPARK-15905:
--

Did you happen to block the stdout or stderr? Such as the disk is full and 
log4j cannot flush logs to the disk?

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-14 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330526#comment-15330526
 ] 

Yan Chen commented on SPARK-15716:
--

The checkpoint path is on HDFS. The application is already shutdown, will try 
to get jstack output next time.

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
> Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92
> SUSE Linux, CentOS 6 and CentOS 7
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15951) Change Executors Page to use datatables to support sorting columns and searching

2016-06-14 Thread Kishor Patil (JIRA)
Kishor Patil created SPARK-15951:


 Summary: Change Executors Page to use datatables to support 
sorting columns and searching
 Key: SPARK-15951
 URL: https://issues.apache.org/jira/browse/SPARK-15951
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Kishor Patil
 Fix For: 2.1.0


Support column sort and search for Executors Server using jQuery DataTable and 
REST API. Before this commit, the executors page was generated hard-coded html 
and can not support search, also, the sorting was disabled if there is any 
application that has more than one attempt. Supporting search and sort (over 
all applications rather than the 20 entries in the current page) in any case 
will greatly improve user experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15888) Python UDF over aggregate fails

2016-06-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15888:
---
Priority: Blocker  (was: Major)

> Python UDF over aggregate fails
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>Priority: Blocker
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15888) Python UDF over aggregate fails

2016-06-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-15888:
--

Assignee: Davies Liu

> Python UDF over aggregate fails
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>Assignee: Davies Liu
>Priority: Blocker
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-14 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330500#comment-15330500
 ] 

Mark Grover commented on SPARK-12177:
-

bq. It's worth mentioning that authentication is also supported via TLS. I am 
aware of a number of people who are using TLS for both authentication and 
encryption. So, the security benefit is available now for some people, at least.
Fair point, thanks.

Ok, so what remains to get this in?
1. The PR (https://github.com/apache/spark/pull/11863) reviewed by me, so it 
probably needs to be reviewed by a committer.
2. Sorry for sounding like a broken record, but I don't think kafka-beta as the 
name for the subproject makes much sense, especially now that the new consumer 
api in Kafka 0.10 is not beta. So, some committer buy in would be more valuable 
there too.

Anything else?

> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-14 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330497#comment-15330497
 ] 

Shixiong Zhu commented on SPARK-15716:
--

I saw there were a lot of 
"org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler". Where's 
the checkpoint path? Looks like the checkpoint writer is pretty slow. Could you 
also provide the jstack output?

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
> Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92
> SUSE Linux, CentOS 6 and CentOS 7
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Resolved] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks

2016-06-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15247.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

This issue has been resolved by https://github.com/apache/spark/pull/13137.

> sqlCtx.read.parquet yields at least n_executors * n_cores tasks
> ---
>
> Key: SPARK-15247
> URL: https://issues.apache.org/jira/browse/SPARK-15247
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Johnny W.
>Assignee: Takeshi Yamamuro
> Fix For: 2.0.0
>
>
> sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even 
> though this is only 1 very small file
> This issue can increase the latency for small jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks

2016-06-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15247:
-
Assignee: Takeshi Yamamuro

> sqlCtx.read.parquet yields at least n_executors * n_cores tasks
> ---
>
> Key: SPARK-15247
> URL: https://issues.apache.org/jira/browse/SPARK-15247
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Johnny W.
>Assignee: Takeshi Yamamuro
>
> sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even 
> though this is only 1 very small file
> This issue can increase the latency for small jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-14 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330482#comment-15330482
 ] 

Tejas Patil edited comment on SPARK-15905 at 6/14/16 8:12 PM:
--

[~zsxwing]

>> Do you have the whole jstack output?

I will not be able to share it as is .. but then looking at the entire 7k lines 
of jstack file and removing stuff like ip address or any company internal stuff 
seems to be lot of work to me.

>> Could you check you disk? Maybe some bad disks cause the hang.

At the time this happened, I did not notice any problems with disk on the box. 
However, will keep an eye about that next time.

>> By the way, how did you use Spark? Did you just run it or call it via some 
>> Process APIs?

We run spark jobs directly via spark-shell


was (Author: tejasp):
@zsxwing

>> Do you have the whole jstack output?

I will not be able to share it as is .. but then looking at the entire 7k lines 
of jstack file and removing stuff like ip address or any company internal stuff 
seems to be lot of work to me.

>> Could you check you disk? Maybe some bad disks cause the hang.

At the time this happened, I did not notice any problems with disk on the box. 
However, will keep an eye about that next time.

>> By the way, how did you use Spark? Did you just run it or call it via some 
>> Process APIs?

We run spark jobs directly via spark-shell

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15905) Driver hung while writing to console progress bar

2016-06-14 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330482#comment-15330482
 ] 

Tejas Patil commented on SPARK-15905:
-

@zsxwing

>> Do you have the whole jstack output?

I will not be able to share it as is .. but then looking at the entire 7k lines 
of jstack file and removing stuff like ip address or any company internal stuff 
seems to be lot of work to me.

>> Could you check you disk? Maybe some bad disks cause the hang.

At the time this happened, I did not notice any problems with disk on the box. 
However, will keep an eye about that next time.

>> By the way, how did you use Spark? Did you just run it or call it via some 
>> Process APIs?

We run spark jobs directly via spark-shell

> Driver hung while writing to console progress bar
> -
>
> Key: SPARK-15905
> URL: https://issues.apache.org/jira/browse/SPARK-15905
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> This leads to driver being not able to get heartbeats from its executors and 
> job being stuck. After looking at the locking dependency amongst the driver 
> threads per the jstack, this is where the driver seems to be stuck.
> {noformat}
> "refresh progress" #113 daemon prio=5 os_prio=0 tid=0x7f7986cbc800 
> nid=0x7887d runnable [0x7f6d3507a000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x7f6eb81dd290> (a java.io.BufferedOutputStream)
> at java.io.PrintStream.write(PrintStream.java:482)
>- locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
> at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
> at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
> - locked <0x7f6eb81dd400> (a java.io.OutputStreamWriter)
> at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
> at java.io.PrintStream.write(PrintStream.java:527)
> - locked <0x7f6eb81dd258> (a java.io.PrintStream)
> at java.io.PrintStream.print(PrintStream.java:669)
> at 
> org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:99)
> at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:69)
> - locked <0x7f6ed33b48a0> (a 
> org.apache.spark.ui.ConsoleProgressBar)
> at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:53)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15904) High Memory Pressure using MLlib K-means

2016-06-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330480#comment-15330480
 ] 

Sean Owen commented on SPARK-15904:
---

I don't think the problem is your code. You're allocating on the one hand too 
little memory (OOME), and on the other hand too much (swapping).

> High Memory Pressure using MLlib K-means
> 
>
> Key: SPARK-15904
> URL: https://issues.apache.org/jira/browse/SPARK-15904
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Mac OS X 10.11.6beta on Macbook Pro 13" mid-2012. 16GB 
> of RAM.
>Reporter: Alessio
>Priority: Minor
>
> Running MLlib K-Means on a ~400MB dataset (12 partitions), persisted on 
> Memory and Disk.
> Everything's fine, although at the end of K-Means, after the number of 
> iterations, the cost function value and the running time there's a nice 
> "Removing RDD  from persistent list" stage. However, during this stage 
> there's a high memory pressure. Weird, since RDDs are about to be removed. 
> Full log of this stage:
> 16/06/12 20:37:33 INFO clustering.KMeans: Run 0 finished in 14 iterations
> 16/06/12 20:37:33 INFO clustering.KMeans: Iterations took 694.544 seconds.
> 16/06/12 20:37:33 INFO clustering.KMeans: KMeans converged in 14 iterations.
> 16/06/12 20:37:33 INFO clustering.KMeans: The cost for the best run is 
> 49784.87126751288.
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 781 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 781
> 16/06/12 20:37:33 INFO rdd.MapPartitionsRDD: Removing RDD 780 from 
> persistence list
> 16/06/12 20:37:33 INFO storage.BlockManager: Removing RDD 780
> I'm running this K-Means on a 16GB machine, with Spark Context as local[*]. 
> My machine has an i5 hyperthreaded dual-core, thus [*] means 4.
> I'm launching this application though spark-submit with --driver-memory 9G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15849) FileNotFoundException on _temporary while doing saveAsTable to S3

2016-06-14 Thread Thomas Demoor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330240#comment-15330240
 ] 

Thomas Demoor commented on SPARK-15849:
---

Forgot to mention, it also speeds up your writes 2x

> FileNotFoundException on _temporary while doing saveAsTable to S3
> -
>
> Key: SPARK-15849
> URL: https://issues.apache.org/jira/browse/SPARK-15849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: AWS EC2 with spark on yarn and s3 storage
>Reporter: Sandeep
>
> When submitting spark jobs to yarn cluster, I occasionally see these error 
> messages while doing saveAsTable. I have tried doing this with 
> spark.speculation=false, and get the same error. These errors are similar to 
> SPARK-2984, but my jobs are writing to S3(s3n) :
> Caused by: java.io.FileNotFoundException: File 
> s3n://xxx/_temporary/0/task_201606080516_0004_m_79 does not exist.
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:506)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
> ... 42 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15849) FileNotFoundException on _temporary while doing saveAsTable to S3

2016-06-14 Thread Thomas Demoor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330238#comment-15330238
 ] 

Thomas Demoor commented on SPARK-15849:
---

Seems like typical list-after-write inconsistency. However, you can avoid this 
issue. With S3, you should use a direct committer instead of the standard 
Hadoop ones. Googling for DirectParquetOutputCommitter should help you along.

There is no reason to have the "write to _temporary and atomically rename to 
final version" as S3 can handle concurrent writers. We are working to get this 
behaviour directly into Hadoop (HADOOP-9565).

> FileNotFoundException on _temporary while doing saveAsTable to S3
> -
>
> Key: SPARK-15849
> URL: https://issues.apache.org/jira/browse/SPARK-15849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: AWS EC2 with spark on yarn and s3 storage
>Reporter: Sandeep
>
> When submitting spark jobs to yarn cluster, I occasionally see these error 
> messages while doing saveAsTable. I have tried doing this with 
> spark.speculation=false, and get the same error. These errors are similar to 
> SPARK-2984, but my jobs are writing to S3(s3n) :
> Caused by: java.io.FileNotFoundException: File 
> s3n://xxx/_temporary/0/task_201606080516_0004_m_79 does not exist.
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:506)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
> ... 42 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15895) _common_metadata and _metadata appearing in the inner partitioning dirs of a partitioned parquet datasets break partitioning discovery

2016-06-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15895.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13623
[https://github.com/apache/spark/pull/13623]

> _common_metadata and _metadata appearing in the inner partitioning dirs of a 
> partitioned parquet datasets break partitioning discovery
> --
>
> Key: SPARK-15895
> URL: https://issues.apache.org/jira/browse/SPARK-15895
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> see 
> https://issues.apache.org/jira/browse/SPARK-13207?focusedCommentId=15305703=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15305703



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-14 Thread Yan Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Chen updated SPARK-15716:
-
Affects Version/s: (was: 1.6.1)
   (was: 1.6.0)
   (was: 1.5.0)
   (was: 2.0.0)

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
> Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92
> SUSE Linux, CentOS 6 and CentOS 7
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15915) CacheManager should use canonicalized plan for planToCache.

2016-06-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330175#comment-15330175
 ] 

Apache Spark commented on SPARK-15915:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13668

> CacheManager should use canonicalized plan for planToCache.
> ---
>
> Key: SPARK-15915
> URL: https://issues.apache.org/jira/browse/SPARK-15915
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.0.0
>
>
> {{DataFrame}} with plan overriding {{sameResult}} but not using canonicalized 
> plan to compare can't cacheTable.
> The example is like:
> {code}
> val localRelation = Seq(1, 2, 3).toDF()
> localRelation.createOrReplaceTempView("localRelation")
> spark.catalog.cacheTable("localRelation")
> assert(
>   localRelation.queryExecution.withCachedData.collect {
> case i: InMemoryRelation => i
>   }.size == 1)
> {code}
> and this will fail as:
> {noformat}
> ArrayBuffer() had size 0 instead of expected size 1
> {noformat}
> The reason is that when do {{spark.catalog.cacheTable("localRelation")}}, 
> {{CacheManager}} tries to cache for the plan wrapped by {{SubqueryAlias}} but 
> when planning for the DataFrame {{localRelation}}, {{CacheManager}} tries to 
> find cached table for the not-wrapped plan because the plan for DataFrame 
> {{localRelation}} is not wrapped.
> Some plans like {{LocalRelation}}, {{LogicalRDD}}, etc. override 
> {{sameResult}} method, but not use canonicalized plan to compare so the 
> {{CacheManager}} can't detect the plans are the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15934) Return binary mode in ThriftServer

2016-06-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330168#comment-15330168
 ] 

Apache Spark commented on SPARK-15934:
--

User 'epahomov' has created a pull request for this issue:
https://github.com/apache/spark/pull/13667

> Return binary mode in ThriftServer
> --
>
> Key: SPARK-15934
> URL: https://issues.apache.org/jira/browse/SPARK-15934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Critical
>
> In spark-2.0.0 preview binary mode was turned off (SPARK-15095). 
> It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode 
> was default and it turned off in 2.0.0.
> Just to describe magnitude of harm not fixing this bug would do in my 
> organization:
> * Tableau works only though Thrift Server and only with binary format. 
> Tableau would not work with spark-2.0.0 at all!
> * I have bunch of analysts in my organization with configured sql 
> clients(DataGrip and Squirrel). I would need to go one by one to change 
> connection string for them(DataGrip). Squirrel simply do not work with http - 
> some jar hell in my case.
> * let me not mention all other stuff which connects to our data 
> infrastructure through ThriftServer as gateway. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR

2016-06-14 Thread Kai Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330166#comment-15330166
 ] 

Kai Jiang commented on SPARK-15767:
---

Thanks [~shivaram]! Yes, as you said, we should shorter name of the function. I 
will open the PR later.

> Decision Tree Regression wrapper in SparkR
> --
>
> Key: SPARK-15767
> URL: https://issues.apache.org/jira/browse/SPARK-15767
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Kai Jiang
>Assignee: Kai Jiang
>
> Implement a wrapper in SparkR to support decision tree regression. R's naive 
> Decision Tree Regression implementation is from package rpart with signature 
> rpart(formula, dataframe, method="anova"). I propose we could implement API 
> like spark.decisionTreeRegression(dataframe, formula, ...) .  After having 
> implemented decision tree classification, we could refactor this two into an 
> API more like rpart()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15864) Inconsistent Behaviors when Uncaching Non-cached Tables

2016-06-14 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15864:

Assignee: Xiao Li

> Inconsistent Behaviors when Uncaching Non-cached Tables
> ---
>
> Key: SPARK-15864
> URL: https://issues.apache.org/jira/browse/SPARK-15864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> To uncache a table, we have two different APIs. 
> {{UNCACHE TABLE}} or {{spark.catalog.uncacheTable}}
> When the table is not cached, the first way will report nothing. However, the 
> second way will report a strange error message:
> {{requirement failed: Table [a: int] is not cached}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >