[jira] [Commented] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled

2015-02-10 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315024#comment-14315024
 ] 

Marcelo Vanzin commented on SPARK-4705:
---

Hi [~twinkle],

I think the UI on the latest screenshot is a little too cluttered. How about:

- Keep the app id as the main link to the application's UI, pointing at the 
last attempt in the case of multiple attempts
- Have the attempt column list the attempt IDs only for those apps that have 
multiple attempts. Those with a single attempt would have an empty cell.

This would result in a redundant link (app id link + link to the last attempt 
pointing at the same place), but I think it looks better. And it's probably 
less confusing for those used to the current UI.

 Driver retries in yarn-cluster mode always fail if event logging is enabled
 ---

 Key: SPARK-4705
 URL: https://issues.apache.org/jira/browse/SPARK-4705
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin
 Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png


 yarn-cluster mode will retry to run the driver in certain failure modes. If 
 even logging is enabled, this will most probably fail, because:
 {noformat}
 Exception in thread Driver java.io.IOException: Log directory 
 hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
  already exists!
 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at 
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:353)
 {noformat}
 The even log path should be more unique. Or perhaps retries of the same app 
 should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5592) java.net.URISyntaxException when insert data to a partitioned table

2015-02-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5592:
--
Assignee: wangfei

 java.net.URISyntaxException when insert data to a partitioned table  
 -

 Key: SPARK-5592
 URL: https://issues.apache.org/jira/browse/SPARK-5592
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei
Assignee: wangfei
 Fix For: 1.3.0


 create table sc as select * 
 from (select '2011-01-11', '2011-01-11+14:18:26' from src tablesample (1 rows)
   union all 
   select '2011-01-11', '2011-01-11+15:18:26' from src tablesample (1 rows)
   union all 
   select '2011-01-11', '2011-01-11+16:18:26' from src tablesample (1 
 rows) ) s;
 create table sc_part (key string) partitioned by (ts string) stored as rcfile;
 set hive.exec.dynamic.partition=true;
 set hive.exec.dynamic.partition.mode=nonstrict;
 insert overwrite table sc_part partition(ts) select * from sc;
 java.net.URISyntaxException: Relative path in absolute URI: 
 ts=2011-01-11+15:18:26
 at org.apache.hadoop.fs.Path.initialize(Path.java:206)
 at org.apache.hadoop.fs.Path.init(Path.java:172)
 at org.apache.hadoop.fs.Path.init(Path.java:94)
 at 
 org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.org$apache$spark$sql$hive$SparkHiveDynamicPartitionWriterContainer$$newWriter$1(hiveWriterContainers.scala:230)
 at 
 org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243)
 at 
 org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243)
 at 
 scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
 at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
 at 
 org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.getLocalFileWriter(hiveWriterContainers.scala:243)
 at 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:113)
 at 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:105)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:105)
 at 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87)
 at 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
 ts=2011-01-11+15:18:26
 at java.net.URI.checkPath(URI.java:1804)
 at java.net.URI.init(URI.java:752)
 at org.apache.hadoop.fs.Path.initialize(Path.java:203)
 ... 21 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5592) java.net.URISyntaxException when insert data to a partitioned table

2015-02-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-5592.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4368
[https://github.com/apache/spark/pull/4368]

 java.net.URISyntaxException when insert data to a partitioned table  
 -

 Key: SPARK-5592
 URL: https://issues.apache.org/jira/browse/SPARK-5592
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei
 Fix For: 1.3.0


 create table sc as select * 
 from (select '2011-01-11', '2011-01-11+14:18:26' from src tablesample (1 rows)
   union all 
   select '2011-01-11', '2011-01-11+15:18:26' from src tablesample (1 rows)
   union all 
   select '2011-01-11', '2011-01-11+16:18:26' from src tablesample (1 
 rows) ) s;
 create table sc_part (key string) partitioned by (ts string) stored as rcfile;
 set hive.exec.dynamic.partition=true;
 set hive.exec.dynamic.partition.mode=nonstrict;
 insert overwrite table sc_part partition(ts) select * from sc;
 java.net.URISyntaxException: Relative path in absolute URI: 
 ts=2011-01-11+15:18:26
 at org.apache.hadoop.fs.Path.initialize(Path.java:206)
 at org.apache.hadoop.fs.Path.init(Path.java:172)
 at org.apache.hadoop.fs.Path.init(Path.java:94)
 at 
 org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.org$apache$spark$sql$hive$SparkHiveDynamicPartitionWriterContainer$$newWriter$1(hiveWriterContainers.scala:230)
 at 
 org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243)
 at 
 org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243)
 at 
 scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
 at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
 at 
 org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.getLocalFileWriter(hiveWriterContainers.scala:243)
 at 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:113)
 at 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:105)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:105)
 at 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87)
 at 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
 ts=2011-01-11+15:18:26
 at java.net.URI.checkPath(URI.java:1804)
 at java.net.URI.init(URI.java:752)
 at org.apache.hadoop.fs.Path.initialize(Path.java:203)
 ... 21 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5668) spark_ec2.py region parameter could be either mandatory or its value displayed

2015-02-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5668.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4457
[https://github.com/apache/spark/pull/4457]

 spark_ec2.py region parameter could be either mandatory or its value displayed
 --

 Key: SPARK-5668
 URL: https://issues.apache.org/jira/browse/SPARK-5668
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.2.0, 1.3.0, 1.4.0
Reporter: Miguel Peralvo
Priority: Minor
  Labels: starter
 Fix For: 1.4.0


 If the region parameter is not specified when invoking spark-ec2 
 (spark-ec2.py behind the scenes) it defaults to us-east-1. When the cluster 
 doesn't belong to that region, after showing the Searching for existing 
 cluster Spark... message, it causes an ERROR: Could not find any existing 
 cluster exception because it doesn't find you cluster in the default region.
 As it doesn't tell you anything about the region, It can be a small headache 
 for new users.
 In 
 http://stackoverflow.com/questions/21171576/why-does-spark-ec2-fail-with-error-could-not-find-any-existing-cluster,
  Dmitriy Selivanov explains it.
 I propose that:
 1. Either we make the search message a little bit more informative with 
 something like Searching for existing cluster Spark in region  + 
 opts.region.
 2. Or we remove the us-east-1 as default and make the --region parameter 
 mandatory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5668) spark_ec2.py region parameter could be either mandatory or its value displayed

2015-02-10 Thread Miguel Peralvo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miguel Peralvo closed SPARK-5668.
-

 spark_ec2.py region parameter could be either mandatory or its value displayed
 --

 Key: SPARK-5668
 URL: https://issues.apache.org/jira/browse/SPARK-5668
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.2.0, 1.3.0, 1.4.0
Reporter: Miguel Peralvo
Priority: Minor
  Labels: starter
 Fix For: 1.4.0


 If the region parameter is not specified when invoking spark-ec2 
 (spark-ec2.py behind the scenes) it defaults to us-east-1. When the cluster 
 doesn't belong to that region, after showing the Searching for existing 
 cluster Spark... message, it causes an ERROR: Could not find any existing 
 cluster exception because it doesn't find you cluster in the default region.
 As it doesn't tell you anything about the region, It can be a small headache 
 for new users.
 In 
 http://stackoverflow.com/questions/21171576/why-does-spark-ec2-fail-with-error-could-not-find-any-existing-cluster,
  Dmitriy Selivanov explains it.
 I propose that:
 1. Either we make the search message a little bit more informative with 
 something like Searching for existing cluster Spark in region  + 
 opts.region.
 2. Or we remove the us-east-1 as default and make the --region parameter 
 mandatory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5613) YarnClientSchedulerBackend fails to get application report when yarn restarts

2015-02-10 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14314821#comment-14314821
 ] 

Patrick Wendell commented on SPARK-5613:


I have cherry picked it into the 1.3 branch.

 YarnClientSchedulerBackend fails to get application report when yarn restarts
 -

 Key: SPARK-5613
 URL: https://issues.apache.org/jira/browse/SPARK-5613
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Kashish Jain
Assignee: Kashish Jain
Priority: Minor
 Fix For: 1.3.0, 1.2.2

   Original Estimate: 24h
  Remaining Estimate: 24h

 Steps to Reproduce
 1) Run any spark job
 2) Stop yarn while the spark job is running (an application id has been 
 generated by now)
 3) Restart yarn now
 4) AsyncMonitorApplication thread fails due to ApplicationNotFoundException 
 exception. This leads to termination of thread. 
 Here is the StackTrace
 15/02/05 05:22:37 INFO Client: Retrying connect to server: 
 nn1/192.168.173.176:8032. Already tried 6 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 15/02/05 05:22:38 INFO Client: Retrying connect to server: 
 nn1/192.168.173.176:8032. Already tried 7 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 15/02/05 05:22:39 INFO Client: Retrying connect to server: 
 nn1/192.168.173.176:8032. Already tried 8 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 15/02/05 05:22:40 INFO Client: Retrying connect to server: 
 nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 5/02/05 05:22:40 INFO Client: Retrying connect to server: 
 nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is 
 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 Exception in thread Yarn application state monitor 
 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
 with id 'application_1423113179043_0003' doesn't exist in RM.
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
   at 
 org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Unknown Source)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
 Source)
   at java.lang.reflect.Constructor.newInstance(Unknown Source)
   at 
 org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
   at 
 org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:166)
   at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
   at java.lang.reflect.Method.invoke(Unknown Source)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
   at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291)
   at 
 org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:116)
   at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:120)
 Caused by: 
 

[jira] [Updated] (SPARK-5592) java.net.URISyntaxException when insert data to a partitioned table

2015-02-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5592:
--
Description: 
{code}
create table sc as select * 
from (select '2011-01-11', '2011-01-11+14:18:26' from src tablesample (1 rows)
  union all 
  select '2011-01-11', '2011-01-11+15:18:26' from src tablesample (1 rows)
  union all 
  select '2011-01-11', '2011-01-11+16:18:26' from src tablesample (1 rows) 
) s;

create table sc_part (key string) partitioned by (ts string) stored as rcfile;

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table sc_part partition(ts) select * from sc;
{code}
Exception thrown:
{code}
java.net.URISyntaxException: Relative path in absolute URI: 
ts=2011-01-11+15:18:26
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.init(Path.java:172)
at org.apache.hadoop.fs.Path.init(Path.java:94)
at 
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.org$apache$spark$sql$hive$SparkHiveDynamicPartitionWriterContainer$$newWriter$1(hiveWriterContainers.scala:230)
at 
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243)
at 
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243)
at 
scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
at 
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.getLocalFileWriter(hiveWriterContainers.scala:243)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:113)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:105)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:105)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
ts=2011-01-11+15:18:26
at java.net.URI.checkPath(URI.java:1804)
at java.net.URI.init(URI.java:752)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
... 21 more
{code}

  was:
create table sc as select * 
from (select '2011-01-11', '2011-01-11+14:18:26' from src tablesample (1 rows)
  union all 
  select '2011-01-11', '2011-01-11+15:18:26' from src tablesample (1 rows)
  union all 
  select '2011-01-11', '2011-01-11+16:18:26' from src tablesample (1 rows) 
) s;

create table sc_part (key string) partitioned by (ts string) stored as rcfile;

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table sc_part partition(ts) select * from sc;

java.net.URISyntaxException: Relative path in absolute URI: 
ts=2011-01-11+15:18:26
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.init(Path.java:172)
at org.apache.hadoop.fs.Path.init(Path.java:94)
at 
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.org$apache$spark$sql$hive$SparkHiveDynamicPartitionWriterContainer$$newWriter$1(hiveWriterContainers.scala:230)
at 
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243)
at 
org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243)
at 
scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
at 

[jira] [Commented] (SPARK-5645) Track local bytes read for shuffles - update UI

2015-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14314811#comment-14314811
 ] 

Apache Spark commented on SPARK-5645:
-

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/4510

 Track local bytes read for shuffles - update UI
 ---

 Key: SPARK-5645
 URL: https://issues.apache.org/jira/browse/SPARK-5645
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis

 Currently we do not track the local bytes read for a shuffle read. The UI 
 only shows the remote bytes read. This is pretty confusing to the user 
 because:
 1) In local mode all shuffle reads are local
 2) the shuffle bytes written from the previous stage might not add up if 
 there are some bytes that are read locally on the shuffle read side
 3) With https://github.com/apache/spark/pull/4067 we display the total number 
 of records so that won't line up with only showing the remote bytes read. 
 I propose we track the remote and local bytes read separately. In the UI show 
 the total bytes read and in brackets show the remote bytes read for a 
 shuffle. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5722) Infer_schma_type incorrect for Integers in pyspark

2015-02-10 Thread Don Drake (JIRA)
Don Drake created SPARK-5722:


 Summary: Infer_schma_type incorrect for Integers in pyspark
 Key: SPARK-5722
 URL: https://issues.apache.org/jira/browse/SPARK-5722
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Don Drake



The Integers datatype in Python does not match what a Scala/Java integer is 
defined as.   This causes inference of data types and schemas to fail when data 
is larger than 2^32 and it is inferred incorrectly as an Integer.

Since the range of valid Python integers is wider than Java Integers, this 
causes problems when inferring Integer vs. Long datatypes.  This will cause 
problems when attempting to save SchemaRDD as Parquet or JSON.

Here's an example:

 sqlCtx = SQLContext(sc)
 from pyspark.sql import Row
 rdd = sc.parallelize([Row(f1='a', f2=100)])
 srdd = sqlCtx.inferSchema(rdd)
 srdd.schema()
StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))

That number is a LongType in Java, but an Integer in python.  We need to check 
the value to see if it should really by a LongType when a IntegerType is 
initially inferred.

More tests:
 from pyspark.sql import _infer_type
# OK
 print _infer_type(1)
IntegerType
# OK
 print _infer_type(2**31-1)
IntegerType
#WRONG
 print _infer_type(2**31)
#WRONG
IntegerType
 print _infer_type(2**61 )
#OK
IntegerType
 print _infer_type(2**71 )
LongType

Java Primitive Types defined:
http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html

Python Built-in Types:
https://docs.python.org/2/library/stdtypes.html#typesnumeric




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5732) Add a option to print the spark version in spark script

2015-02-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-5732:

Description: Naturally, we may need to add a option to 

 Add a option to print the spark version in spark script
 ---

 Key: SPARK-5732
 URL: https://issues.apache.org/jira/browse/SPARK-5732
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: uncleGen
Priority: Minor

 Naturally, we may need to add a option to 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5732) Add a option to print the spark version in spark script

2015-02-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-5732:

Description: Naturally, we may need to add a option to print the spark 
version in spark script. It  (was: Naturally, we may need to add a option to )

 Add a option to print the spark version in spark script
 ---

 Key: SPARK-5732
 URL: https://issues.apache.org/jira/browse/SPARK-5732
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: uncleGen
Priority: Minor

 Naturally, we may need to add a option to print the spark version in spark 
 script. It



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5732) Add an option to print the spark version in spark script

2015-02-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-5732:

Summary: Add an option to print the spark version in spark script  (was: 
Add a option to print the spark version in spark script)

 Add an option to print the spark version in spark script
 

 Key: SPARK-5732
 URL: https://issues.apache.org/jira/browse/SPARK-5732
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: uncleGen
Priority: Minor

 Naturally, we may need to add an option to print the spark version in spark 
 script. It is pretty common in many script tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5732) Add a option to print the spark version in spark script

2015-02-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-5732:

Description: Naturally, we may need to add an option to print the spark 
version in spark script. It is pretty common in many script tools  (was: 
Naturally, we may need to add a option to print the spark version in spark 
script. It is pretty common in many script tools)

 Add a option to print the spark version in spark script
 ---

 Key: SPARK-5732
 URL: https://issues.apache.org/jira/browse/SPARK-5732
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: uncleGen
Priority: Minor

 Naturally, we may need to add an option to print the spark version in spark 
 script. It is pretty common in many script tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5732) Add a option to print the spark version in spark script

2015-02-10 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-5732:

Description: Naturally, we may need to add a option to print the spark 
version in spark script. It is pretty common in many script tools  (was: 
Naturally, we may need to add a option to print the spark version in spark 
script. It)

 Add a option to print the spark version in spark script
 ---

 Key: SPARK-5732
 URL: https://issues.apache.org/jira/browse/SPARK-5732
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: uncleGen
Priority: Minor

 Naturally, we may need to add a option to print the spark version in spark 
 script. It is pretty common in many script tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5733) Error Link in Pagination of HistroyPage when showing Incomplete Applications

2015-02-10 Thread Mars Gu (JIRA)
Mars Gu created SPARK-5733:
--

 Summary: Error Link in Pagination of HistroyPage when showing 
Incomplete Applications 
 Key: SPARK-5733
 URL: https://issues.apache.org/jira/browse/SPARK-5733
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Mars Gu


The links in pagination of HistroyPage is wrong when showing Incomplete 
Applications. 

If 2 is click on the following page 
http://history-server:18080/?page=1showIncomplete=true;, it will go  to  
http://history-server:18080/?page=2; instead of 
http://history-server:18080/?page=2showIncomplete=true;.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark

2015-02-10 Thread Jason Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315605#comment-14315605
 ] 

Jason Dai commented on SPARK-5654:
--

I agree with this proposal. Given all ongoing efforts around data analytics in 
Spark (e.g., DataFrame, ml, etc.), an R frontend for Spark seems to be very 
well aligned with the project's future plans.

 Integrate SparkR into Apache Spark
 --

 Key: SPARK-5654
 URL: https://issues.apache.org/jira/browse/SPARK-5654
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Shivaram Venkataraman

 The SparkR project [1] provides a light-weight frontend to launch Spark jobs 
 from R. The project was started at the AMPLab around a year ago and has been 
 incubated as its own project to make sure it can be easily merged into 
 upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s 
 goals are similar to PySpark and shares a similar design pattern as described 
 in our meetup talk[2], Spark Summit presentation[3].
 Integrating SparkR into the Apache project will enable R users to use Spark 
 out of the box and given R’s large user base, it will help the Spark project 
 reach more users.  Additionally, work in progress features like providing R 
 integration with ML Pipelines and Dataframes can be better achieved by 
 development in a unified code base.
 SparkR is available under the Apache 2.0 License and does not have any 
 external dependencies other than requiring users to have R and Java installed 
 on their machines.  SparkR’s developers come from many organizations 
 including UC Berkeley, Alteryx, Intel and we will support future development, 
 maintenance after the integration.
 [1] https://github.com/amplab-extras/SparkR-pkg
 [2] http://files.meetup.com/3138542/SparkR-meetup.pdf
 [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5702) Allow short names for built-in data sources

2015-02-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5702.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Allow short names for built-in data sources
 ---

 Key: SPARK-5702
 URL: https://issues.apache.org/jira/browse/SPARK-5702
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.3.0


 e.g. json, parquet, jdbc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5677) Python DataFrame API remaining tasks

2015-02-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5677:
---
Description: 
- DataFrame.renameColumn
- DataFrame.show (also we should override __repr__ or __str__)
- dtypes should use simpleString rather than jsonValue


  was:
- DataFrame.renameColumn
- DataFrame.show (also we should override __repr__ or __str__)



 Python DataFrame API remaining tasks
 

 Key: SPARK-5677
 URL: https://issues.apache.org/jira/browse/SPARK-5677
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu

 - DataFrame.renameColumn
 - DataFrame.show (also we should override __repr__ or __str__)
 - dtypes should use simpleString rather than jsonValue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4382) Add locations parameter to Twitter Stream

2015-02-10 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4382:
---
Component/s: Streaming

 Add locations parameter to Twitter Stream
 -

 Key: SPARK-4382
 URL: https://issues.apache.org/jira/browse/SPARK-4382
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Liang-Chi Hsieh

 When we request Tweet stream, geo-location is one of the most important 
 parameters. In addition to the track parameter, the locations parameter is 
 widely used to ask for the Tweets falling within the requested bounding 
 boxes. This PR adds the locations parameter to existing APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3688) LogicalPlan can't resolve column correctlly

2015-02-10 Thread Yi Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Tian updated SPARK-3688:
---
Description: 
How to reproduce this problem:
{code}
CREATE TABLE t1(x INT);
CREATE TABLE t2(a STRUCTx: INT, k INT);
SELECT a.x FROM t1 a JOIN t2 b ON a.x = b.k;
{code}

  was:
How to reproduce this problem:
create a table:
{code}
create table test (a string, b string);
{code}
execute sql:
{code}
select a.b ,count(1) from test a join test t group by a.b;
{code}


 LogicalPlan can't resolve column correctlly
 ---

 Key: SPARK-3688
 URL: https://issues.apache.org/jira/browse/SPARK-3688
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yi Tian

 How to reproduce this problem:
 {code}
 CREATE TABLE t1(x INT);
 CREATE TABLE t2(a STRUCTx: INT, k INT);
 SELECT a.x FROM t1 a JOIN t2 b ON a.x = b.k;
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5734) Allow creating a DataFrame from local Python data

2015-02-10 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-5734:
--

 Summary: Allow creating a DataFrame from local Python data
 Key: SPARK-5734
 URL: https://issues.apache.org/jira/browse/SPARK-5734
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Davies Liu


Maybe a local Python list and a Pandas DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5677) Python DataFrame API remaining tasks

2015-02-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5677:
---
Description: 
- DataFrame.renameColumn
- DataFrame.show (also we should override __repr__ or __str__)


  was:
- DataFrame.renameColumn
- DataFrame.show
- Create data frame from local data collection (does this work?)
- Move all data types into a types package
- load/saveAsTable/createExternalTable, etc ( see 
https://github.com/apache/spark/pull/4446 )


 Python DataFrame API remaining tasks
 

 Key: SPARK-5677
 URL: https://issues.apache.org/jira/browse/SPARK-5677
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu

 - DataFrame.renameColumn
 - DataFrame.show (also we should override __repr__ or __str__)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5714) Refactor initial step of LDA to remove redundant operations

2015-02-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5714.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4501
[https://github.com/apache/spark/pull/4501]

 Refactor initial step of LDA to remove redundant operations
 ---

 Key: SPARK-5714
 URL: https://issues.apache.org/jira/browse/SPARK-5714
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.3.0


 The initialState of LDA performs several RDD operations that looks redundant. 
 This pr tries to simplify these operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5714) Refactor initial step of LDA to remove redundant operations

2015-02-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5714:
-
Assignee: Liang-Chi Hsieh

 Refactor initial step of LDA to remove redundant operations
 ---

 Key: SPARK-5714
 URL: https://issues.apache.org/jira/browse/SPARK-5714
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.3.0


 The initialState of LDA performs several RDD operations that looks redundant. 
 This pr tries to simplify these operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5522) Accelerate the Histroty Server start

2015-02-10 Thread Mars Gu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars Gu updated SPARK-5522:
---
Description: 
When starting the history server, all the log files will be fetched and parsed 
in order to get the applications' meta data e.g. App Name, Start Time, 
Duration, etc. In our production cluster, there exist 2600 log files (160G) in 
HDFS and it costs 3 hours to restart the history server, which is a little bit 
too long for us.

It would be better, if the history server can show logs with missing 
information during start-up and fill the missing information after fetching and 
parsing a log file.

  was:
When starting the history server, all the log files will be fetched and parsed 
in order to get the applications' meta data e.g. App Name, Start Time, 
Duration, etc. In our production cluster, there exist 2600 log files (160G) in 
HDFS and it costs 3 hours to restart the history server, which is a little bit 
too long for us.

It would be better, if the history server does not fetch all the log files but 
only the meta data during start-up.



 Accelerate the Histroty Server start
 

 Key: SPARK-5522
 URL: https://issues.apache.org/jira/browse/SPARK-5522
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Mars Gu

 When starting the history server, all the log files will be fetched and 
 parsed in order to get the applications' meta data e.g. App Name, Start Time, 
 Duration, etc. In our production cluster, there exist 2600 log files (160G) 
 in HDFS and it costs 3 hours to restart the history server, which is a little 
 bit too long for us.
 It would be better, if the history server can show logs with missing 
 information during start-up and fill the missing information after fetching 
 and parsing a log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5522) Accelerate the Histroty Server start

2015-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315686#comment-14315686
 ] 

Apache Spark commented on SPARK-5522:
-

User 'marsishandsome' has created a pull request for this issue:
https://github.com/apache/spark/pull/4525

 Accelerate the Histroty Server start
 

 Key: SPARK-5522
 URL: https://issues.apache.org/jira/browse/SPARK-5522
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Mars Gu

 When starting the history server, all the log files will be fetched and 
 parsed in order to get the applications' meta data e.g. App Name, Start Time, 
 Duration, etc. In our production cluster, there exist 2600 log files (160G) 
 in HDFS and it costs 3 hours to restart the history server, which is a little 
 bit too long for us.
 It would be better, if the history server can show logs with missing 
 information during start-up and fill the missing information after fetching 
 and parsing a log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5522) Accelerate the Histroty Server start

2015-02-10 Thread Mars Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315683#comment-14315683
 ] 

Mars Gu commented on SPARK-5522:


https://github.com/apache/spark/pull/4525

 Accelerate the Histroty Server start
 

 Key: SPARK-5522
 URL: https://issues.apache.org/jira/browse/SPARK-5522
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Mars Gu

 When starting the history server, all the log files will be fetched and 
 parsed in order to get the applications' meta data e.g. App Name, Start Time, 
 Duration, etc. In our production cluster, there exist 2600 log files (160G) 
 in HDFS and it costs 3 hours to restart the history server, which is a little 
 bit too long for us.
 It would be better, if the history server can show logs with missing 
 information during start-up and fill the missing information after fetching 
 and parsing a log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark

2015-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315569#comment-14315569
 ] 

Apache Spark commented on SPARK-5722:
-

User 'dondrake' has created a pull request for this issue:
https://github.com/apache/spark/pull/4521

 Infer_schema_type incorrect for Integers in pyspark
 ---

 Key: SPARK-5722
 URL: https://issues.apache.org/jira/browse/SPARK-5722
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Don Drake

 The Integers datatype in Python does not match what a Scala/Java integer is 
 defined as.   This causes inference of data types and schemas to fail when 
 data is larger than 2^32 and it is inferred incorrectly as an Integer.
 Since the range of valid Python integers is wider than Java Integers, this 
 causes problems when inferring Integer vs. Long datatypes.  This will cause 
 problems when attempting to save SchemaRDD as Parquet or JSON.
 Here's an example:
 {code}
  sqlCtx = SQLContext(sc)
  from pyspark.sql import Row
  rdd = sc.parallelize([Row(f1='a', f2=100)])
  srdd = sqlCtx.inferSchema(rdd)
  srdd.schema()
 StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))
 {code}
 That number is a LongType in Java, but an Integer in python.  We need to 
 check the value to see if it should really by a LongType when a IntegerType 
 is initially inferred.
 More tests:
 {code}
  from pyspark.sql import _infer_type
 # OK
  print _infer_type(1)
 IntegerType
 # OK
  print _infer_type(2**31-1)
 IntegerType
 #WRONG
  print _infer_type(2**31)
 #WRONG
 IntegerType
  print _infer_type(2**61 )
 #OK
 IntegerType
  print _infer_type(2**71 )
 LongType
 {code}
 Java Primitive Types defined:
 http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
 Python Built-in Types:
 https://docs.python.org/2/library/stdtypes.html#typesnumeric



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-02-10 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315695#comment-14315695
 ] 

Manoj Kumar commented on SPARK-5016:


[~tgaloppo] How about a method setParallelGaussianUpdate(bool) (defaulting to 
False) which would allow the user to decide whether to use this feature or not?

[~mengxr] I would like to your know your thoughts on this as well.

 GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
 ---

 Key: SPARK-5016
 URL: https://issues.apache.org/jira/browse/SPARK-5016
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 If numFeatures or k are large, GMM EM should distribute the matrix inverse 
 computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5735) Replace uses of EasyMock with Mockito

2015-02-10 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-5735:
--

 Summary: Replace uses of EasyMock with Mockito
 Key: SPARK-5735
 URL: https://issues.apache.org/jira/browse/SPARK-5735
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Patrick Wendell


There are a few reasons we should drop EasyMock. First, we should have a single 
mocking framework in our tests in general to keep things consistent. Second, 
EasyMock has caused us some dependency pain in our tests due to objenesis. We 
aren't totally sure but suspect such conflicts might be causing non 
deterministic test failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5735) Replace uses of EasyMock with Mockito

2015-02-10 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5735:
---
Assignee: Josh Rosen

 Replace uses of EasyMock with Mockito
 -

 Key: SPARK-5735
 URL: https://issues.apache.org/jira/browse/SPARK-5735
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Patrick Wendell
Assignee: Josh Rosen

 There are a few reasons we should drop EasyMock. First, we should have a 
 single mocking framework in our tests in general to keep things consistent. 
 Second, EasyMock has caused us some dependency pain in our tests due to 
 objenesis. We aren't totally sure but suspect such conflicts might be causing 
 non deterministic test failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5732) Add a option to print the spark version in spark script

2015-02-10 Thread uncleGen (JIRA)
uncleGen created SPARK-5732:
---

 Summary: Add a option to print the spark version in spark script
 Key: SPARK-5732
 URL: https://issues.apache.org/jira/browse/SPARK-5732
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: uncleGen
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5732) Add an option to print the spark version in spark script

2015-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315585#comment-14315585
 ] 

Apache Spark commented on SPARK-5732:
-

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4522

 Add an option to print the spark version in spark script
 

 Key: SPARK-5732
 URL: https://issues.apache.org/jira/browse/SPARK-5732
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: uncleGen
Priority: Minor

 Naturally, we may need to add an option to print the spark version in spark 
 script. It is pretty common in many script tools



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5733) Error Link in Pagination of HistroyPage when showing Incomplete Applications

2015-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315614#comment-14315614
 ] 

Apache Spark commented on SPARK-5733:
-

User 'marsishandsome' has created a pull request for this issue:
https://github.com/apache/spark/pull/4523

 Error Link in Pagination of HistroyPage when showing Incomplete Applications 
 -

 Key: SPARK-5733
 URL: https://issues.apache.org/jira/browse/SPARK-5733
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Mars Gu

 The links in pagination of HistroyPage is wrong when showing Incomplete 
 Applications. 
 If 2 is click on the following page 
 http://history-server:18080/?page=1showIncomplete=true;, it will go  to  
 http://history-server:18080/?page=2; instead of 
 http://history-server:18080/?page=2showIncomplete=true;.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5733) Error Link in Pagination of HistroyPage when showing Incomplete Applications

2015-02-10 Thread Mars Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315611#comment-14315611
 ] 

Mars Gu commented on SPARK-5733:


https://github.com/apache/spark/pull/4523

 Error Link in Pagination of HistroyPage when showing Incomplete Applications 
 -

 Key: SPARK-5733
 URL: https://issues.apache.org/jira/browse/SPARK-5733
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Mars Gu

 The links in pagination of HistroyPage is wrong when showing Incomplete 
 Applications. 
 If 2 is click on the following page 
 http://history-server:18080/?page=1showIncomplete=true;, it will go  to  
 http://history-server:18080/?page=2; instead of 
 http://history-server:18080/?page=2showIncomplete=true;.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4945) Add overwrite option support for SchemaRDD.saveAsParquetFile

2015-02-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-4945.
--
Resolution: Implemented

 Add overwrite option support for SchemaRDD.saveAsParquetFile
 

 Key: SPARK-4945
 URL: https://issues.apache.org/jira/browse/SPARK-4945
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3688) LogicalPlan can't resolve column correctlly

2015-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315625#comment-14315625
 ] 

Apache Spark commented on SPARK-3688:
-

User 'tianyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/4524

 LogicalPlan can't resolve column correctlly
 ---

 Key: SPARK-3688
 URL: https://issues.apache.org/jira/browse/SPARK-3688
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yi Tian

 How to reproduce this problem:
 {code}
 CREATE TABLE t1(x INT);
 CREATE TABLE t2(a STRUCTx: INT, k INT);
 SELECT a.x FROM t1 a JOIN t2 b ON a.x = b.k;
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5183) Document data source API

2015-02-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5183:
---
Description: We need to document the data types the caller needs to support.

 Document data source API
 

 Key: SPARK-5183
 URL: https://issues.apache.org/jira/browse/SPARK-5183
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Reporter: Yin Huai
Priority: Blocker

 We need to document the data types the caller needs to support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5568) Python API for the write support of the data source API

2015-02-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-5568.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

It has been resolved by https://github.com/apache/spark/pull/4446.

 Python API for the write support of the data source API
 ---

 Key: SPARK-5568
 URL: https://issues.apache.org/jira/browse/SPARK-5568
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5706) Support inference schema from a single json string

2015-02-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5706.
--
Resolution: Duplicate

 Support inference schema from a single json string
 --

 Key: SPARK-5706
 URL: https://issues.apache.org/jira/browse/SPARK-5706
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao

 We notice some developers are complaining the json parsing is very slow, 
 particularly in inferring schema. Some of them suggesting if we can provide 
 an simple interface for inferring the schema by providing a single complete 
 json string record, instead of sampling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4336) auto detect type from json string

2015-02-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4336.
--
Resolution: Won't Fix

Per PR discussion, WontFix due to concerns over speed

 auto detect type from json string
 -

 Key: SPARK-4336
 URL: https://issues.apache.org/jira/browse/SPARK-4336
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5724) misconfiguration in Akka system

2015-02-10 Thread Nan Zhu (JIRA)
Nan Zhu created SPARK-5724:
--

 Summary: misconfiguration in Akka system
 Key: SPARK-5724
 URL: https://issues.apache.org/jira/browse/SPARK-5724
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.1.0
Reporter: Nan Zhu


In AkkaUtil, we set several failure detector related the parameters as 
following 

{code:title=AkkaUtil.scala|borderStyle=solid}
al akkaConf = ConfigFactory.parseMap(conf.getAkkaConf.toMap[String, String])
  .withFallback(akkaSslConfig).withFallback(ConfigFactory.parseString(
  s
  |akka.daemonic = on
  |akka.loggers = [akka.event.slf4j.Slf4jLogger]
  |akka.stdout-loglevel = ERROR
  |akka.jvm-exit-on-fatal-error = off
  |akka.remote.require-cookie = $requireCookie
  |akka.remote.secure-cookie = $secureCookie
  |akka.remote.transport-failure-detector.heartbeat-interval = 
$akkaHeartBeatInterval s
  |akka.remote.transport-failure-detector.acceptable-heartbeat-pause = 
$akkaHeartBeatPauses s
  |akka.remote.transport-failure-detector.threshold = $akkaFailureDetector
  |akka.actor.provider = akka.remote.RemoteActorRefProvider
  |akka.remote.netty.tcp.transport-class = 
akka.remote.transport.netty.NettyTransport
  |akka.remote.netty.tcp.hostname = $host
  |akka.remote.netty.tcp.port = $port
  |akka.remote.netty.tcp.tcp-nodelay = on
  |akka.remote.netty.tcp.connection-timeout = $akkaTimeout s
  |akka.remote.netty.tcp.maximum-frame-size = ${akkaFrameSize}B
  |akka.remote.netty.tcp.execution-pool-size = $akkaThreads
  |akka.actor.default-dispatcher.throughput = $akkaBatchSize
  |akka.log-config-on-start = $logAkkaConfig
  |akka.remote.log-remote-lifecycle-events = $lifecycleEvents
  |akka.log-dead-letters = $lifecycleEvents
  |akka.log-dead-letters-during-shutdown = $lifecycleEvents
  .stripMargin))

{code}

Actually, we do not have any parameter naming 
akka.remote.transport-failure-detector.threshold

see: http://doc.akka.io/docs/akka/2.3.4/general/configuration.html

what we have is akka.remote.watch-failure-detector.threshold 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5343) ShortestPaths traverses backwards

2015-02-10 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-5343.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4478
https://github.com/apache/spark/pull/4478

 ShortestPaths traverses backwards
 -

 Key: SPARK-5343
 URL: https://issues.apache.org/jira/browse/SPARK-5343
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0
Reporter: Michael Malak
 Fix For: 1.3.0


 GraphX ShortestPaths seems to be following edges backwards instead of 
 forwards:
 import org.apache.spark.graphx._
 val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), 
 sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L,
 lib.ShortestPaths.run(g,Array(3)).vertices.collect
 res1: Array[(org.apache.spark.graphx.VertexId, 
 org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 
 - 0)), (2,Map()))
 lib.ShortestPaths.run(g,Array(1)).vertices.collect
 res2: Array[(org.apache.spark.graphx.VertexId, 
 org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), 
 (3,Map(1 - 2)), (2,Map(1 - 1)))
 The following changes may be what will make it run forward:
 Change one occurrence of src to dst in
 https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64
 Change three occurrences of dst to src in
 https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-02-10 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315087#comment-14315087
 ] 

Andrew Ash commented on SPARK-4879:
---

This is really great work [~joshrosen]!  I really appreciate the effort you're 
putting into getting this one figured out since these kind of non-deterministic 
bugs are the most painful for both users and devs to figure out.

 Missing output partitions after job completes with speculative execution
 

 Key: SPARK-4879
 URL: https://issues.apache.org/jira/browse/SPARK-4879
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
 Attachments: speculation.txt, speculation2.txt


 When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
 save output files may report that they have completed successfully even 
 though some output partitions written by speculative tasks may be missing.
 h3. Reproduction
 This symptom was reported to me by a Spark user and I've been doing my own 
 investigation to try to come up with an in-house reproduction.
 I'm still working on a reliable local reproduction for this issue, which is a 
 little tricky because Spark won't schedule speculated tasks on the same host 
 as the original task, so you need an actual (or containerized) multi-host 
 cluster to test speculation.  Here's a simple reproduction of some of the 
 symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
 spark.speculation=true}}:
 {code}
 // Rig a job such that all but one of the tasks complete instantly
 // and one task runs for 20 seconds on its first attempt and instantly
 // on its second attempt:
 val numTasks = 100
 sc.parallelize(1 to numTasks, 
 numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =
   if (ctx.partitionId == 0) {  // If this is the one task that should run 
 really slow
 if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
  Thread.sleep(20 * 1000)
 }
   }
   iter
 }.map(x = (x, x)).saveAsTextFile(/test4)
 {code}
 When I run this, I end up with a job that completes quickly (due to 
 speculation) but reports failures from the speculated task:
 {code}
 [...]
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
 (100/100)
 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
 console:22) finished in 0.856 s
 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
 console:22, took 0.885438374 s
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
 for 70.1 in stage 3.0 because task 70 has already completed successfully
 scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
 stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
 java.io.IOException: Failed to save output of task: 
 attempt_201412110141_0003_m_49_413
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
 org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 One interesting thing to note about this stack trace: if we look at 
 {{FileOutputCommitter.java:160}} 
 ([link|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.5.0-mr1-cdh5.2.0/org/apache/hadoop/mapred/FileOutputCommitter.java#160]),
  this point in the execution seems to correspond to a case where a task 
 completes, attempts to commit its output, fails for some reason, then deletes 
 the destination file, tries again, and fails:
 {code}
  if (fs.isFile(taskOutput)) {
 152  Path finalOutputPath = getFinalPath(jobOutputDir, taskOutput, 
 153  

[jira] [Resolved] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5021.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4459
[https://github.com/apache/spark/pull/4459]

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar
 Fix For: 1.3.0


 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-02-10 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315077#comment-14315077
 ] 

Josh Rosen commented on SPARK-4879:
---

This issue is _really_ hard to reproduce, but I managed to trigger the original 
bug as part of the testing for my patch.  Here's what I ran:

{code}
~/spark-1.3.0-SNAPSHOT-bin-1.0.4/bin/spark-shell --conf 
spark.speculation.multiplier=1 --conf spark.speculation.quantile=0.01 --conf 
spark.speculation=true --conf  
spark.hadoop.outputCommitCoordination.enabled=false
{code}

{code}
val numTasks = 100
val numTrials = 100
val outputPath = /output-committer-bug-
val sleepDuration = 1000

for (trial - 0 to (numTrials - 1)) {
  val outputLocation = outputPath + trial
  sc.parallelize(1 to numTasks, numTasks).mapPartitionsWithContext { case (ctx, 
iter) =
if (ctx.partitionId % 5 == 0) {
  if (ctx.attemptNumber == 0) {  // If this is the first attempt, run slow
   Thread.sleep(sleepDuration)
  }
}
iter
  }.map(identity).saveAsTextFile(outputLocation)
  Thread.sleep(sleepDuration * 2)
  println(TESTING OUTPUT OF TRIAL  + trial)
  val savedData = sc.textFile(outputLocation).map(_.toInt).collect()
  if (savedData.toSet != (1 to numTasks).toSet) {
println(MISSING:  + ((1 to numTasks).toSet -- savedData.toSet))
assert(false)
  }
  println(- * 80)
}
{code}

It took 22 runs until I actually observed missing output partitions (several of 
the earlier runs threw spurious exceptions and didn't have missing outputs):

{code}
[...]
15/02/10 22:17:21 INFO scheduler.DAGScheduler: Job 66 finished: saveAsTextFile 
at console:39, took 2.479592 s
15/02/10 22:17:21 WARN scheduler.TaskSetManager: Lost task 75.0 in stage 66.0 
(TID 6861, ip-172-31-1-124.us-west-2.compute.internal): java.io.IOException: 
Failed to save output of task: attempt_201502102217_0066_m_75_6861
at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
at 
org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
at 
org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
at 
org.apache.spark.SparkHadoopWriter.performCommit$1(SparkHadoopWriter.scala:113)
at 
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:150)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/02/10 22:17:21 WARN scheduler.TaskSetManager: Lost task 80.0 in stage 66.0 
(TID 6866, ip-172-31-11-151.us-west-2.compute.internal): java.io.IOException: 
The temporary job-output directory 
hdfs://ec2-54-213-142-80.us-west-2.compute.amazonaws.com:9000/output-committer-bug-22/_temporary
 doesn't exist!
at 
org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
at 
org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/02/10 22:17:21 INFO scheduler.TaskSetManager: Lost task 85.0 in stage 66.0 
(TID 6871) on executor ip-172-31-1-124.us-west-2.compute.internal: 
java.io.IOException (The temporary job-output directory 
hdfs://ec2-54-213-142-80.us-west-2.compute.amazonaws.com:9000/output-committer-bug-22/_temporary
 doesn't exist!) [duplicate 1]
15/02/10 22:17:21 INFO scheduler.TaskSetManager: Lost task 90.0 in stage 66.0 
(TID 6876) on executor ip-172-31-1-124.us-west-2.compute.internal: 
java.io.IOException (The temporary job-output directory 

[jira] [Commented] (SPARK-5724) misconfiguration in Akka system

2015-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315135#comment-14315135
 ] 

Apache Spark commented on SPARK-5724:
-

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/4512

 misconfiguration in Akka system
 ---

 Key: SPARK-5724
 URL: https://issues.apache.org/jira/browse/SPARK-5724
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Nan Zhu

 In AkkaUtil, we set several failure detector related the parameters as 
 following 
 {code:title=AkkaUtil.scala|borderStyle=solid}
 al akkaConf = ConfigFactory.parseMap(conf.getAkkaConf.toMap[String, String])
   .withFallback(akkaSslConfig).withFallback(ConfigFactory.parseString(
   s
   |akka.daemonic = on
   |akka.loggers = [akka.event.slf4j.Slf4jLogger]
   |akka.stdout-loglevel = ERROR
   |akka.jvm-exit-on-fatal-error = off
   |akka.remote.require-cookie = $requireCookie
   |akka.remote.secure-cookie = $secureCookie
   |akka.remote.transport-failure-detector.heartbeat-interval = 
 $akkaHeartBeatInterval s
   |akka.remote.transport-failure-detector.acceptable-heartbeat-pause = 
 $akkaHeartBeatPauses s
   |akka.remote.transport-failure-detector.threshold = $akkaFailureDetector
   |akka.actor.provider = akka.remote.RemoteActorRefProvider
   |akka.remote.netty.tcp.transport-class = 
 akka.remote.transport.netty.NettyTransport
   |akka.remote.netty.tcp.hostname = $host
   |akka.remote.netty.tcp.port = $port
   |akka.remote.netty.tcp.tcp-nodelay = on
   |akka.remote.netty.tcp.connection-timeout = $akkaTimeout s
   |akka.remote.netty.tcp.maximum-frame-size = ${akkaFrameSize}B
   |akka.remote.netty.tcp.execution-pool-size = $akkaThreads
   |akka.actor.default-dispatcher.throughput = $akkaBatchSize
   |akka.log-config-on-start = $logAkkaConfig
   |akka.remote.log-remote-lifecycle-events = $lifecycleEvents
   |akka.log-dead-letters = $lifecycleEvents
   |akka.log-dead-letters-during-shutdown = $lifecycleEvents
   .stripMargin))
 {code}
 Actually, we do not have any parameter naming 
 akka.remote.transport-failure-detector.threshold
 see: http://doc.akka.io/docs/akka/2.3.4/general/configuration.html
 what we have is akka.remote.watch-failure-detector.threshold 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4261) make right version info for beeline

2015-02-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4261:
-
Priority: Trivial  (was: Major)

 make right version info for beeline
 ---

 Key: SPARK-4261
 URL: https://issues.apache.org/jira/browse/SPARK-4261
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 1.1.0
Reporter: wangfei
Priority: Trivial

 Running with spark sql jdbc/odbc, the output will be
 JackydeMacBook-Pro:spark1 jackylee$ bin/beeline 
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Beeline version ??? by Apache Hive
 we should make right version info for beeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4261) make right version info for beeline

2015-02-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4261.
--
Resolution: Won't Fix

Per PR discussion, this is a cosmetic issue, and can't really be resolved in 
the context of a Spark assembly since the code is looking at the JAR's Manifest 
version, which will be Spark's at best.

 make right version info for beeline
 ---

 Key: SPARK-4261
 URL: https://issues.apache.org/jira/browse/SPARK-4261
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 1.1.0
Reporter: wangfei
Priority: Trivial

 Running with spark sql jdbc/odbc, the output will be
 JackydeMacBook-Pro:spark1 jackylee$ bin/beeline 
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Beeline version ??? by Apache Hive
 we should make right version info for beeline



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled

2015-02-10 Thread Twinkle Sachdeva (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Twinkle Sachdeva updated SPARK-4705:

Attachment: Screen Shot 2015-02-10 at 6.27.49 pm.png

UI-2

 Driver retries in yarn-cluster mode always fail if event logging is enabled
 ---

 Key: SPARK-4705
 URL: https://issues.apache.org/jira/browse/SPARK-4705
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin
 Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png


 yarn-cluster mode will retry to run the driver in certain failure modes. If 
 even logging is enabled, this will most probably fail, because:
 {noformat}
 Exception in thread Driver java.io.IOException: Log directory 
 hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
  already exists!
 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at 
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:353)
 {noformat}
 The even log path should be more unique. Or perhaps retries of the same app 
 should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5718) Change to native offset management for ReliableKafkaReceiver

2015-02-10 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-5718:
--

 Summary: Change to native offset management for 
ReliableKafkaReceiver
 Key: SPARK-5718
 URL: https://issues.apache.org/jira/browse/SPARK-5718
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao


Kafka 0.8.2 supports native offsets management instead of ZK, this will get 
better performance, for now in ReliableKafkaReceiver, we rely on ZK to manage 
the offsets, this potentially will be a bottleneck if the injection rate is 
high (once per 200ms by default), so here in order to get better performance as 
well as keeping consistent with Kafka, add native offset management for 
ReliableKafkaReceiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled

2015-02-10 Thread Twinkle Sachdeva (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Twinkle Sachdeva updated SPARK-4705:

Attachment: Screen Shot 2015-02-10 at 6.27.49 pm.png

UI - 2

 Driver retries in yarn-cluster mode always fail if event logging is enabled
 ---

 Key: SPARK-4705
 URL: https://issues.apache.org/jira/browse/SPARK-4705
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin
 Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png, multi-attempts 
 with no attempt based UI.png


 yarn-cluster mode will retry to run the driver in certain failure modes. If 
 even logging is enabled, this will most probably fail, because:
 {noformat}
 Exception in thread Driver java.io.IOException: Log directory 
 hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
  already exists!
 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at 
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:353)
 {noformat}
 The even log path should be more unique. Or perhaps retries of the same app 
 should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled

2015-02-10 Thread Twinkle Sachdeva (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14314140#comment-14314140
 ] 

Twinkle Sachdeva commented on SPARK-4705:
-

Hi,

So here is the final approach I have taken regarding UI.

If there is no application, where logging of event is happening per attempt, 
then previous UI will continue to appear. As soon as there is one or more 
application, whose events has been logged per attempt ( even if there is only 
one attempt), then UI will change to per attempt UI ( please see the 
attachment).

By logging per attempt, I meant the changed folder structure.

Please note that in case of no attempt specific UI, anchor was on application 
id value. In the new UI( UI - 2 ) , anchor will appear for attempt ID.

Thanks,

 Driver retries in yarn-cluster mode always fail if event logging is enabled
 ---

 Key: SPARK-4705
 URL: https://issues.apache.org/jira/browse/SPARK-4705
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin
 Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png, multi-attempts 
 with no attempt based UI.png


 yarn-cluster mode will retry to run the driver in certain failure modes. If 
 even logging is enabled, this will most probably fail, because:
 {noformat}
 Exception in thread Driver java.io.IOException: Log directory 
 hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
  already exists!
 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at 
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:353)
 {noformat}
 The even log path should be more unique. Or perhaps retries of the same app 
 should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled

2015-02-10 Thread Twinkle Sachdeva (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Twinkle Sachdeva updated SPARK-4705:

Attachment: (was: multi-attempts with no attempt based UI.png)

 Driver retries in yarn-cluster mode always fail if event logging is enabled
 ---

 Key: SPARK-4705
 URL: https://issues.apache.org/jira/browse/SPARK-4705
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin

 yarn-cluster mode will retry to run the driver in certain failure modes. If 
 even logging is enabled, this will most probably fail, because:
 {noformat}
 Exception in thread Driver java.io.IOException: Log directory 
 hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
  already exists!
 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at 
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:353)
 {noformat}
 The even log path should be more unique. Or perhaps retries of the same app 
 should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled

2015-02-10 Thread Twinkle Sachdeva (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Twinkle Sachdeva updated SPARK-4705:

Attachment: (was: Screen Shot 2015-02-10 at 6.27.49 pm.png)

 Driver retries in yarn-cluster mode always fail if event logging is enabled
 ---

 Key: SPARK-4705
 URL: https://issues.apache.org/jira/browse/SPARK-4705
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin
 Attachments: multi-attempts with no attempt based UI.png


 yarn-cluster mode will retry to run the driver in certain failure modes. If 
 even logging is enabled, this will most probably fail, because:
 {noformat}
 Exception in thread Driver java.io.IOException: Log directory 
 hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
  already exists!
 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at 
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:353)
 {noformat}
 The even log path should be more unique. Or perhaps retries of the same app 
 should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5726) Hadamard Vector Product Transformer

2015-02-10 Thread Octavian Geagla (JIRA)
Octavian Geagla created SPARK-5726:
--

 Summary: Hadamard Vector Product Transformer
 Key: SPARK-5726
 URL: https://issues.apache.org/jira/browse/SPARK-5726
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Octavian Geagla


I originally posted my idea here: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-td10265.html

A draft of this feature is implemented, documented, and tested already.  Code 
is on a branch on my fork here: 
https://github.com/ogeagla/spark/compare/spark-mllib-weighting

I'm curious if there is any interest in this feature, in which case I'd 
appreciate some feedback.  One thing that might be useful is an example/test 
case using the transformer within the ML pipeline, since there are not any 
examples which use Vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5725) ParquetRelation2.equals throws when compared with non-Parquet relations

2015-02-10 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-5725:
-

 Summary: ParquetRelation2.equals throws when compared with 
non-Parquet relations
 Key: SPARK-5725
 URL: https://issues.apache.org/jira/browse/SPARK-5725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian


it's an apparent mistake, [forgot to return {{false}} in other 
cases|https://github.com/apache/spark/blob/5820961289eb98e45eb467efa316c7592b8d619c/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L150-L155].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5725) ParquetRelation2.equals throws when compared with non-Parquet relations

2015-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315155#comment-14315155
 ] 

Apache Spark commented on SPARK-5725:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/4513

 ParquetRelation2.equals throws when compared with non-Parquet relations
 ---

 Key: SPARK-5725
 URL: https://issues.apache.org/jira/browse/SPARK-5725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 it's an apparent mistake, [forgot to return {{false}} in other 
 cases|https://github.com/apache/spark/blob/5820961289eb98e45eb467efa316c7592b8d619c/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L150-L155].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4682) Consolidate various 'Clock' classes

2015-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315162#comment-14315162
 ] 

Apache Spark commented on SPARK-4682:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4514

 Consolidate various 'Clock' classes
 ---

 Key: SPARK-4682
 URL: https://issues.apache.org/jira/browse/SPARK-4682
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Streaming
Reporter: Josh Rosen

 Spark currently has at four different {{Clock}} classes for mocking out 
 wall-clock time, most of which are nearly identical.  We should replace all 
 of these by one Clock class that lives in the utilities package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5644) Delete tmp dir when sc is stop

2015-02-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5644.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4412
[https://github.com/apache/spark/pull/4412]

 Delete tmp dir when sc is stop
 --

 Key: SPARK-5644
 URL: https://issues.apache.org/jira/browse/SPARK-5644
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Weizhong
Priority: Minor
 Fix For: 1.4.0


 When we run driver as a service which will not stop. In this service process 
 we will create SparkContext and run job and then stop it, because we only 
 call sc.stop but not exit this service process so the tmp dirs created by 
 HttpFileServer and SparkEnv will not be deleted after SparkContext is 
 stopped, and this will lead to creating too many tmp dirs if we create many 
 SparkContext to run job in this service process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5727) Deprecate, remove Debian packaging

2015-02-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315294#comment-14315294
 ] 

Apache Spark commented on SPARK-5727:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4516

 Deprecate, remove Debian packaging
 --

 Key: SPARK-5727
 URL: https://issues.apache.org/jira/browse/SPARK-5727
 Project: Spark
  Issue Type: Task
  Components: Build, Deploy
Affects Versions: 1.2.1
Reporter: Sean Owen
Assignee: Sean Owen

 Per discussion on the mailing list 
 (https://www.mail-archive.com/dev@spark.apache.org/msg07598.html), this JIRA 
 proposes:
 - For 1.3.x, deprecate the Debian packaging (the {{deb}} profile) by adding a 
 warning message of some kind when invoking the profile
 - For 1.4.x, remove the packaging
 Two PRs coming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5081) Shuffle write increases

2015-02-10 Thread Kevin Jung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308620#comment-14308620
 ] 

Kevin Jung edited comment on SPARK-5081 at 2/11/15 12:56 AM:
-

To test under the same condition, I set this to snappy for all spark version 
but this problem occurs. AFAIK, lz4 needs more CPU time than snappy but it has 
better compression ratio.


was (Author: kallsu):
To test under the same condition, I set this to snappy for all spark version 
but this problem occurs. AFA I know, lz4 needs more CPU time than snappy but it 
has better compression ratio.

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5243) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster

2015-02-10 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-5243:
--
Description: 
Spark will hang if calling spark-submit under the conditions:

1. the cluster has only one worker.
2. driver memory + executor memory  worker memory
3. deploy-mode = cluster

This usually happens during development for beginners.
There should be some exit mechanism or at least a warning message in the output 
of the spark-submit.

I would like to know your opinions about if a fix is needed and better fix 
options.



  was:
Spark will hang if calling spark-submit under the conditions:

1. the cluster has only one worker.
2. driver memory + executor memory  worker memory
3. deploy-mode = cluster

This usually happens during development for beginners.
There should be some exit mechanism or at least a warning message in the output 
of the spark-submit.

I am preparing PR for the case. And I would like to know your opinions about if 
a fix is needed and better fix options.




 Spark will hang if (driver memory + executor memory) exceeds limit on a 
 1-worker cluster
 

 Key: SPARK-5243
 URL: https://issues.apache.org/jira/browse/SPARK-5243
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.2.0
 Environment: centos, others should be similar
Reporter: yuhao yang
Priority: Minor

 Spark will hang if calling spark-submit under the conditions:
 1. the cluster has only one worker.
 2. driver memory + executor memory  worker memory
 3. deploy-mode = cluster
 This usually happens during development for beginners.
 There should be some exit mechanism or at least a warning message in the 
 output of the spark-submit.
 I would like to know your opinions about if a fix is needed and better fix 
 options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5682) Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle

2015-02-10 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated SPARK-5682:

Comment: was deleted

(was: encrypted_shuffle.patch.4 is how to reuse hadoop encrypted class to 
enable spark encrypted shuffle.
How to use
patch -p1encrypted_shuffle.patch.4)

 Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle
 --

 Key: SPARK-5682
 URL: https://issues.apache.org/jira/browse/SPARK-5682
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Reporter: liyunzhang_intel
 Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx


 Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
 data safer. This feature is necessary in spark. We reuse hadoop encrypted 
 shuffle feature to spark and because ugi credential info is necessary in 
 encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn 
 framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5243) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster

2015-02-10 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-5243:
--
Description: 
Spark will hang if calling spark-submit under the conditions:

1. the cluster has only one worker.
2. driver memory + executor memory  worker memory
3. deploy-mode = cluster

This usually happens during development for beginners.
There should be some exit mechanism or at least a warning message in the output 
of the spark-submit.

I would like to know your opinions about if a fix is needed (is this by 
design?) and better fix options.



  was:
Spark will hang if calling spark-submit under the conditions:

1. the cluster has only one worker.
2. driver memory + executor memory  worker memory
3. deploy-mode = cluster

This usually happens during development for beginners.
There should be some exit mechanism or at least a warning message in the output 
of the spark-submit.

I would like to know your opinions about if a fix is needed and better fix 
options.




 Spark will hang if (driver memory + executor memory) exceeds limit on a 
 1-worker cluster
 

 Key: SPARK-5243
 URL: https://issues.apache.org/jira/browse/SPARK-5243
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.2.0
 Environment: centos, others should be similar
Reporter: yuhao yang
Priority: Minor

 Spark will hang if calling spark-submit under the conditions:
 1. the cluster has only one worker.
 2. driver memory + executor memory  worker memory
 3. deploy-mode = cluster
 This usually happens during development for beginners.
 There should be some exit mechanism or at least a warning message in the 
 output of the spark-submit.
 I would like to know your opinions about if a fix is needed (is this by 
 design?) and better fix options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5725) ParquetRelation2.equals throws when compared with non-Parquet relations

2015-02-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-5725.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4513
[https://github.com/apache/spark/pull/4513]

 ParquetRelation2.equals throws when compared with non-Parquet relations
 ---

 Key: SPARK-5725
 URL: https://issues.apache.org/jira/browse/SPARK-5725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.3.0


 it's an apparent mistake, [forgot to return {{false}} in other 
 cases|https://github.com/apache/spark/blob/5820961289eb98e45eb467efa316c7592b8d619c/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L150-L155].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5718) Add native offset management for ReliableKafkaReceiver

2015-02-10 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-5718:
---
Summary: Add native offset management for ReliableKafkaReceiver  (was: 
Change to native offset management for ReliableKafkaReceiver)

 Add native offset management for ReliableKafkaReceiver
 --

 Key: SPARK-5718
 URL: https://issues.apache.org/jira/browse/SPARK-5718
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao

 Kafka 0.8.2 supports native offsets management instead of ZK, this will get 
 better performance, for now in ReliableKafkaReceiver, we rely on ZK to manage 
 the offsets, this potentially will be a bottleneck if the injection rate is 
 high (once per 200ms by default), so here in order to get better performance 
 as well as keeping consistent with Kafka, add native offset management for 
 ReliableKafkaReceiver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2