[jira] [Updated] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3288:
---
Labels: starter  (was: )

> All fields in TaskMetrics should be private and use getters/setters
> ---
>
> Key: SPARK-3288
> URL: https://issues.apache.org/jira/browse/SPARK-3288
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Patrick Wendell
>  Labels: starter
>
> This is particularly bad because we expose this as a developer API. 
> Technically a library could create a TaskMetrics object and then change the 
> values inside of it and pass it onto someone else. It can be written pretty 
> compactly like below:
> {code}
>   /**
>* Number of bytes written for the shuffle by this task
>*/
>   @volatile private var _shuffleBytesWritten: Long = _
>   def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += 
> value
>   def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= 
> value
>   def shuffleBytesWritten = _shuffleBytesWritten
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3288:
---
Assignee: (was: Andrew Or)

> All fields in TaskMetrics should be private and use getters/setters
> ---
>
> Key: SPARK-3288
> URL: https://issues.apache.org/jira/browse/SPARK-3288
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Patrick Wendell
>  Labels: starter
>
> This is particularly bad because we expose this as a developer API. 
> Technically a library could create a TaskMetrics object and then change the 
> values inside of it and pass it onto someone else. It can be written pretty 
> compactly like below:
> {code}
>   /**
>* Number of bytes written for the shuffle by this task
>*/
>   @volatile private var _shuffleBytesWritten: Long = _
>   def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += 
> value
>   def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= 
> value
>   def shuffleBytesWritten = _shuffleBytesWritten
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3576) Provide script for creating the Spark AMI from scratch

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3576.

Resolution: Fixed

This was fixed in spark-ec2 itself

> Provide script for creating the Spark AMI from scratch
> --
>
> Key: SPARK-3576
> URL: https://issues.apache.org/jira/browse/SPARK-3576
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147465#comment-14147465
 ] 

Patrick Wendell commented on SPARK-3687:


Can you perform a jstack on the executor when it is hanging? We usually only 
post things on JIRA like this when a specific issue has been debugged a bit 
more. But if you can produce a jstack of the hung executor we can keep it open 
:)

> Spark hang while processing more than 100 sequence files
> 
>
> Key: SPARK-3687
> URL: https://issues.apache.org/jira/browse/SPARK-3687
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Ziv Huang
>
> In my application, I read more than 100 sequence files to a JavaPairRDD, 
> perform flatmap to get another JavaRDD, and then use takeOrdered to get the 
> result.
> It is quite often (but not always) that the spark hangs while the executing 
> some of 110th-130th tasks.
> The job can hang for several hours, maybe forever (I can't wait for its 
> completion).
> When the spark job hangs, I can't find any error message in anywhere, and I 
> can't kill the job from web UI.
> The current workaround is to use coalesce to reduce the number of partitions 
> to be processed.
> I never get a job hanged if the number of partitions to be processed is no 
> greater than 80.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2778) Add unit tests for Yarn integration

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2778.

   Resolution: Fixed
Fix Version/s: 1.2.0

Fixed by:
https://github.com/apache/spark/pull/2257

> Add unit tests for Yarn integration
> ---
>
> Key: SPARK-2778
> URL: https://issues.apache.org/jira/browse/SPARK-2778
> Project: Spark
>  Issue Type: Test
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.2.0
>
>
> It would be nice to add some Yarn integration tests to the unit tests in 
> Spark; Yarn provides a "MiniYARNCluster" class that can be used to spawn a 
> cluster locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2647) DAGScheduler plugs others when processing one JobSubmitted event

2014-09-24 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147378#comment-14147378
 ] 

Nan Zhu commented on SPARK-2647:


isn't it the expected behaviour as we keep DAGScheduler as a single-thread mode?

> DAGScheduler plugs others when processing one JobSubmitted event
> 
>
> Key: SPARK-2647
> URL: https://issues.apache.org/jira/browse/SPARK-2647
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: YanTang Zhai
>
> If a few of jobs are submitted, DAGScheduler plugs others when processing one 
> JobSubmitted event.
> For example ont JobSubmitted event is processed as follows and costs much time
> "spark-akka.actor.default-dispatcher-67" daemon prio=10 
> tid=0x7f75ec001000 nid=0x7dd6 in Object.wait() [0x7f76063e1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoopcdh3.ipc.Client.call(Client.java:1130)
>   - locked <0x000783b17330> (a org.apache.hadoopcdh3.ipc.Client$Call)
>   at org.apache.hadoopcdh3.ipc.RPC$Invoker.invoke(RPC.java:241)
>   at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
>   at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83)
>   at 
> org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:60)
>   at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
>   at 
> org.apache.hadoopcdh3.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1472)
>   at 
> org.apache.hadoopcdh3.hdfs.DFSClient.getBlockLocations(DFSClient.java:1498)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:208)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:204)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getFileBlockLocations(Cdh3DistributedFileSystem.java:204)
>   at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1812)
>   at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1797)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:233)
>   at 
> StorageEngineClient.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:141)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:172)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:54)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartition

[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Ziv Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziv Huang updated SPARK-3687:
-
Description: 
In my application, I read more than 100 sequence files to a JavaPairRDD, 
perform flatmap to get another JavaRDD, and then use takeOrdered to get the 
result.
It is quite often (but not always) that the spark hangs while the executing 
some of 110th-130th tasks.
The job can hang for several hours, maybe forever (I can't wait for its 
completion).
When the spark job hangs, I can't find any error message in anywhere, and I 
can't kill the job from web UI.

The current workaround is to use coalesce to reduce the number of partitions to 
be processed.
I never get a job hanged if the number of partitions to be processed is no 
greater than 80.

  was:
In my application, I read more than 100 sequence files to a JavaPairRDD, 
perform flatmap to get another JavaRDD, and then use takeOrdered to get the 
result.
It is quite often (but not always) that the spark hangs while the executing 
some of 110th-130th tasks.
The job can hang for several hours, maybe forever (I can't wait for its 
completion).
When the spark job hangs, I can't find any error message in anywhere, and I 
can't kill the job from web UI.

The current workaround is to use coalesce to reduce the number of partitions to 
be processed.
I never get job hanged if the number of partitions to be processed is no 
greater than 80.


> Spark hang while processing more than 100 sequence files
> 
>
> Key: SPARK-3687
> URL: https://issues.apache.org/jira/browse/SPARK-3687
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Ziv Huang
>
> In my application, I read more than 100 sequence files to a JavaPairRDD, 
> perform flatmap to get another JavaRDD, and then use takeOrdered to get the 
> result.
> It is quite often (but not always) that the spark hangs while the executing 
> some of 110th-130th tasks.
> The job can hang for several hours, maybe forever (I can't wait for its 
> completion).
> When the spark job hangs, I can't find any error message in anywhere, and I 
> can't kill the job from web UI.
> The current workaround is to use coalesce to reduce the number of partitions 
> to be processed.
> I never get a job hanged if the number of partitions to be processed is no 
> greater than 80.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Ziv Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziv Huang updated SPARK-3687:
-
Description: 
In my application, I read more than 100 sequence files to a JavaPairRDD, 
perform flatmap to get another JavaRDD, and then use takeOrdered to get the 
result.
It is quite often (but not always) that the spark hangs while the executing 
some of 110th-130th tasks.
The job can hang for several hours, maybe forever (I can't wait for its 
completion).
When the spark job hangs, I can't find any error message in anywhere, and I 
can't kill the job from web UI.

The current workaround is to use coalesce to reduce the number of partitions to 
be processed.
I never get job hanged if the number of partitions to be processed is no 
greater than 80.

  was:In my application, I read more than 100 sequence files to a JavaPairRDD, 
perform flatmap to get another JavaRDD, and then use takeOrdered


> Spark hang while processing more than 100 sequence files
> 
>
> Key: SPARK-3687
> URL: https://issues.apache.org/jira/browse/SPARK-3687
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Ziv Huang
>
> In my application, I read more than 100 sequence files to a JavaPairRDD, 
> perform flatmap to get another JavaRDD, and then use takeOrdered to get the 
> result.
> It is quite often (but not always) that the spark hangs while the executing 
> some of 110th-130th tasks.
> The job can hang for several hours, maybe forever (I can't wait for its 
> completion).
> When the spark job hangs, I can't find any error message in anywhere, and I 
> can't kill the job from web UI.
> The current workaround is to use coalesce to reduce the number of partitions 
> to be processed.
> I never get job hanged if the number of partitions to be processed is no 
> greater than 80.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Ziv Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziv Huang updated SPARK-3687:
-
Description: In my application, I read more than 100 sequence files to a 
JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered  
(was: In my application, I read more than 100 sequence files, )

> Spark hang while processing more than 100 sequence files
> 
>
> Key: SPARK-3687
> URL: https://issues.apache.org/jira/browse/SPARK-3687
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Ziv Huang
>
> In my application, I read more than 100 sequence files to a JavaPairRDD, 
> perform flatmap to get another JavaRDD, and then use takeOrdered



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Ziv Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziv Huang updated SPARK-3687:
-
Description: In my application, I read more than 100 sequence files,   
(was: I use spark )

> Spark hang while processing more than 100 sequence files
> 
>
> Key: SPARK-3687
> URL: https://issues.apache.org/jira/browse/SPARK-3687
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Ziv Huang
>
> In my application, I read more than 100 sequence files, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Ziv Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziv Huang updated SPARK-3687:
-
Description: I use spark 

> Spark hang while processing more than 100 sequence files
> 
>
> Key: SPARK-3687
> URL: https://issues.apache.org/jira/browse/SPARK-3687
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Ziv Huang
>
> I use spark 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Ziv Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziv Huang updated SPARK-3687:
-
Component/s: Spark Core

> Spark hang while processing more than 100 sequence files
> 
>
> Key: SPARK-3687
> URL: https://issues.apache.org/jira/browse/SPARK-3687
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Ziv Huang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Ziv Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziv Huang updated SPARK-3687:
-
Affects Version/s: 1.0.2
   1.1.0

> Spark hang while processing more than 100 sequence files
> 
>
> Key: SPARK-3687
> URL: https://issues.apache.org/jira/browse/SPARK-3687
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Ziv Huang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-24 Thread Ziv Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ziv Huang updated SPARK-3687:
-
Summary: Spark hang while processing more than 100 sequence files  (was: 
Spark hang while )

> Spark hang while processing more than 100 sequence files
> 
>
> Key: SPARK-3687
> URL: https://issues.apache.org/jira/browse/SPARK-3687
> Project: Spark
>  Issue Type: Bug
>Reporter: Ziv Huang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3687) Spark hang while

2014-09-24 Thread Ziv Huang (JIRA)
Ziv Huang created SPARK-3687:


 Summary: Spark hang while 
 Key: SPARK-3687
 URL: https://issues.apache.org/jira/browse/SPARK-3687
 Project: Spark
  Issue Type: Bug
Reporter: Ziv Huang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3686) flume.SparkSinkSuite.Success is flaky

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147331#comment-14147331
 ] 

Apache Spark commented on SPARK-3686:
-

User 'harishreedharan' has created a pull request for this issue:
https://github.com/apache/spark/pull/2531

> flume.SparkSinkSuite.Success is flaky
> -
>
> Key: SPARK-3686
> URL: https://issues.apache.org/jira/browse/SPARK-3686
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>Assignee: Hari Shreedharan
>Priority: Blocker
>
> {code}
> Error Message
> 4000 did not equal 5000
> Stacktrace
> sbt.ForkMain$ForkError: 4000 did not equal 5000
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:498)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1559)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:416)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite.org$apache$spark$streaming$flume$sink$SparkSinkSuite$$assertChannelIsEmpty(SparkSinkSuite.scala:195)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply$mcV$sp(SparkSinkSuite.scala:54)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1559)
>   at org.scalatest.Suite$class.run(Suite.scala:1423)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204)
>   at org.scalatest.FunSuite.run(FunSuite.scala:1559)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:444)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:651)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Example test result (this will stop working in a few days):
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/719/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming.flume.sink/SparkSinkSuite/Success_with_ack/



--
This m

[jira] [Resolved] (SPARK-546) Support full outer join and multiple join in a single shuffle

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-546.
---
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Aaron Staple

Fixed by:
https://github.com/apache/spark/pull/1395

> Support full outer join and multiple join in a single shuffle
> -
>
> Key: SPARK-546
> URL: https://issues.apache.org/jira/browse/SPARK-546
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: Reynold Xin
>Assignee: Aaron Staple
> Fix For: 1.2.0
>
>
> RDD[(K,V)] now supports left/right outer join but not full outer join.
> Also it'd be nice to provide a way for users to join multiple RDDs on the 
> same key in a single shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3686) flume.SparkSinkSuite.Success is flaky

2014-09-24 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147316#comment-14147316
 ] 

Hari Shreedharan commented on SPARK-3686:
-

Unlike the other tests in this suite, this one does not have a sleep to let the 
sink commit the transactions back to the channel. So because this does not give 
enough time for the channel to actually becoming empty. Let me add a sleep - 
will send a PR and run the pre-commit hook a bunch of times to ensure that it 
fixes it.

> flume.SparkSinkSuite.Success is flaky
> -
>
> Key: SPARK-3686
> URL: https://issues.apache.org/jira/browse/SPARK-3686
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>Assignee: Hari Shreedharan
>Priority: Blocker
>
> {code}
> Error Message
> 4000 did not equal 5000
> Stacktrace
> sbt.ForkMain$ForkError: 4000 did not equal 5000
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:498)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1559)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:416)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite.org$apache$spark$streaming$flume$sink$SparkSinkSuite$$assertChannelIsEmpty(SparkSinkSuite.scala:195)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply$mcV$sp(SparkSinkSuite.scala:54)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1559)
>   at org.scalatest.Suite$class.run(Suite.scala:1423)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204)
>   at org.scalatest.FunSuite.run(FunSuite.scala:1559)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:444)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:651)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Example test result (this will stop working in a 

[jira] [Commented] (SPARK-3686) flume.SparkSinkSuite.Success is flaky

2014-09-24 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147314#comment-14147314
 ] 

Hari Shreedharan commented on SPARK-3686:
-

Looking into this.

> flume.SparkSinkSuite.Success is flaky
> -
>
> Key: SPARK-3686
> URL: https://issues.apache.org/jira/browse/SPARK-3686
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>Assignee: Hari Shreedharan
>Priority: Blocker
>
> {code}
> Error Message
> 4000 did not equal 5000
> Stacktrace
> sbt.ForkMain$ForkError: 4000 did not equal 5000
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:498)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1559)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:416)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite.org$apache$spark$streaming$flume$sink$SparkSinkSuite$$assertChannelIsEmpty(SparkSinkSuite.scala:195)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply$mcV$sp(SparkSinkSuite.scala:54)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
>   at 
> org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1559)
>   at org.scalatest.Suite$class.run(Suite.scala:1423)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204)
>   at org.scalatest.FunSuite.run(FunSuite.scala:1559)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:444)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:651)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Example test result (this will stop working in a few days):
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/719/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming.flume.sink/SparkSinkSuite/Success_with_ack/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-3666) Extract interfaces for EdgeRDD and VertexRDD

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147280#comment-14147280
 ] 

Apache Spark commented on SPARK-3666:
-

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/2530

> Extract interfaces for EdgeRDD and VertexRDD
> 
>
> Key: SPARK-3666
> URL: https://issues.apache.org/jira/browse/SPARK-3666
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3665) Java API for GraphX

2014-09-24 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-3665:
--
Description: 
The Java API will wrap the Scala API in a similar manner as JavaRDD. Components 
will include:
# JavaGraph
#- removes optional param from persist, subgraph, mapReduceTriplets, 
Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply
#- removes implicit {{=:=}} param from mapVertices, outerJoinVertices
#- merges multiple parameters lists
#- incorporates GraphOps
# JavaVertexRDD
# JavaEdgeRDD
# JavaGraphLoader
#- removes optional params, or uses builder pattern

  was:
The Java API will wrap the Scala API in a similar manner as JavaRDD. Components 
will include:
1. JavaGraph
-- removes optional param from persist, subgraph, mapReduceTriplets, 
Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply
-- removes implicit {{=:=}} param from mapVertices, outerJoinVertices
-- merges multiple parameters lists
-- incorporates GraphOps
2. JavaVertexRDD
3. JavaEdgeRDD
4. JavaGraphLoader
-- removes optional params, or uses builder pattern


> Java API for GraphX
> ---
>
> Key: SPARK-3665
> URL: https://issues.apache.org/jira/browse/SPARK-3665
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX, Java API
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> The Java API will wrap the Scala API in a similar manner as JavaRDD. 
> Components will include:
> # JavaGraph
> #- removes optional param from persist, subgraph, mapReduceTriplets, 
> Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply
> #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices
> #- merges multiple parameters lists
> #- incorporates GraphOps
> # JavaVertexRDD
> # JavaEdgeRDD
> # JavaGraphLoader
> #- removes optional params, or uses builder pattern



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3610) History server log name should not be based on user input

2014-09-24 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147274#comment-14147274
 ] 

Kousuke Saruta edited comment on SPARK-3610 at 9/25/14 2:35 AM:


Hi [~SK], I'm trying to resolve similar issue and I think I can resolve this 
issue using Application ID.
See https://github.com/apache/spark/pull/2432



was (Author: sarutak):
Hi [~skrishna...@gmail.com], I'm trying to resolve similar issue and I think I 
can resolve this issue using Application ID.
See https://github.com/apache/spark/pull/2432


> History server log name should not be based on user input
> -
>
> Key: SPARK-3610
> URL: https://issues.apache.org/jira/browse/SPARK-3610
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: SK
>Priority: Critical
>
> Right now we use the user-defined application name when creating the logging 
> file for the history server. We should use some type of GUID generated from 
> inside of Spark instead of allowing user input here. It can cause errors if 
> users provide characters that are not valid in filesystem paths.
> Original bug report:
> {quote}
> The default log files for the Mllib examples use a rather long naming 
> convention that includes special characters like parentheses and comma.For 
> e.g. one of my log files is named 
> "binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032".
> When I click on the program on the history server page (at port 18080), to 
> view the detailed application logs, the history server crashes and I need to 
> restart it. I am using Spark 1.1 on a mesos cluster.
> I renamed the  log file by removing the special characters and  then it loads 
> up correctly. I am not sure which program is creating the log files. Can it 
> be changed so that the default log file naming convention does not include  
> special characters? 
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3610) History server log name should not be based on user input

2014-09-24 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147274#comment-14147274
 ] 

Kousuke Saruta commented on SPARK-3610:
---

Hi [~skrishna...@gmail.com], I'm trying to resolve similar issue and I think I 
can resolve this issue using Application ID.
See https://github.com/apache/spark/pull/2432


> History server log name should not be based on user input
> -
>
> Key: SPARK-3610
> URL: https://issues.apache.org/jira/browse/SPARK-3610
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: SK
>Priority: Critical
>
> Right now we use the user-defined application name when creating the logging 
> file for the history server. We should use some type of GUID generated from 
> inside of Spark instead of allowing user input here. It can cause errors if 
> users provide characters that are not valid in filesystem paths.
> Original bug report:
> {quote}
> The default log files for the Mllib examples use a rather long naming 
> convention that includes special characters like parentheses and comma.For 
> e.g. one of my log files is named 
> "binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032".
> When I click on the program on the history server page (at port 18080), to 
> view the detailed application logs, the history server crashes and I need to 
> restart it. I am using Spark 1.1 on a mesos cluster.
> I renamed the  log file by removing the special characters and  then it loads 
> up correctly. I am not sure which program is creating the log files. Can it 
> be changed so that the default log file naming convention does not include  
> special characters? 
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3412) Add Missing Types for Row API

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147261#comment-14147261
 ] 

Apache Spark commented on SPARK-3412:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/2529

> Add Missing Types for Row API
> -
>
> Key: SPARK-3412
> URL: https://issues.apache.org/jira/browse/SPARK-3412
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3685) Spark's local dir scheme is not configurable

2014-09-24 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147206#comment-14147206
 ] 

Andrew Or commented on SPARK-3685:
--

Note that this is not meaningful unless we also change the usages of this to 
use the Hadoop FileSystem. This requires a non-trivial refactor of shuffle and 
spill code to use the Hadoop API.

> Spark's local dir scheme is not configurable
> 
>
> Key: SPARK-3685
> URL: https://issues.apache.org/jira/browse/SPARK-3685
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> When you try to set local dirs to "hdfs:/tmp/foo" it doesn't work. What it 
> will try to do is create a folder called "hdfs:" and put "tmp" inside it. 
> This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead 
> of Hadoop's file system to parse this path. We also need to resolve the path 
> appropriately.
> This may not have an urgent use case, but it fails silently and does what is 
> least expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3686) flume.SparkSinkSuite.Success is flaky

2014-09-24 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-3686:
--

 Summary: flume.SparkSinkSuite.Success is flaky
 Key: SPARK-3686
 URL: https://issues.apache.org/jira/browse/SPARK-3686
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Hari Shreedharan
Priority: Blocker


{code}
Error Message

4000 did not equal 5000
Stacktrace

sbt.ForkMain$ForkError: 4000 did not equal 5000
at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:498)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1559)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:416)
at 
org.apache.spark.streaming.flume.sink.SparkSinkSuite.org$apache$spark$streaming$flume$sink$SparkSinkSuite$$assertChannelIsEmpty(SparkSinkSuite.scala:195)
at 
org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply$mcV$sp(SparkSinkSuite.scala:54)
at 
org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
at 
org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$1.apply(SparkSinkSuite.scala:40)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1559)
at org.scalatest.Suite$class.run(Suite.scala:1423)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204)
at org.scalatest.FunSuite.run(FunSuite.scala:1559)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:444)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:651)
at sbt.ForkMain$Run$2.call(ForkMain.java:294)
at sbt.ForkMain$Run$2.call(ForkMain.java:284)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

Example test result (this will stop working in a few days):
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/719/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming.flume.sink/SparkSinkSuite/Success_with_ack/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3476) Yarn ClientBase.validateArgs memory checks wrong

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147180#comment-14147180
 ] 

Apache Spark commented on SPARK-3476:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/2528

> Yarn ClientBase.validateArgs memory checks wrong
> 
>
> Key: SPARK-3476
> URL: https://issues.apache.org/jira/browse/SPARK-3476
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>
> The yarn ClientBase.validateArgs  memory checks are no longer valid.  It used 
> to be that the overhead was taken out of what the user specified, now we add 
> it on top of what the user specifies.   We can probably just remove these. 
> (args.amMemory <= memoryOverhead) -> ("Error: AM memory size must be" +
> "greater than: " + memoryOverhead),
>   (args.executorMemory <= memoryOverhead) -> ("Error: Executor memory 
> size" +
> "must be greater than: " + memoryOverhead.toString)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3685) Spark's local dir scheme is not configurable

2014-09-24 Thread Andrew Or (JIRA)
Andrew Or created SPARK-3685:


 Summary: Spark's local dir scheme is not configurable
 Key: SPARK-3685
 URL: https://issues.apache.org/jira/browse/SPARK-3685
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Andrew Or


When you try to set local dirs to "hdfs:/tmp/foo" it doesn't work. What it will 
try to do is create a folder called "hdfs:" and put "tmp" inside it. This is 
because in Util#getOrCreateLocalRootDirs we use java.io.File instead of 
Hadoop's file system to parse this path. We also need to resolve the path 
appropriately.

This may not have an urgent use case, but it fails silently and does what is 
least expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3615) Kafka test should not hard code Zookeeper port

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3615.

Resolution: Fixed

https://github.com/apache/spark/pull/2483

> Kafka test should not hard code Zookeeper port
> --
>
> Key: SPARK-3615
> URL: https://issues.apache.org/jira/browse/SPARK-3615
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>Assignee: Saisai Shao
>Priority: Blocker
>
> This is causing failures in our master build if port 2181 is contented. 
> Instead of binding to a static port we should re-factor this such that it 
> opens a socket on port 0 and then reads back the port. So we can never have 
> contention.
> {code}
> sbt.ForkMain$ForkError: Address already in use
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:444)
>   at sun.nio.ch.Net.bind(Net.java:436)
>   at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:67)
>   at 
> org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:95)
>   at 
> org.apache.spark.streaming.kafka.KafkaTestUtils$EmbeddedZookeeper.(KafkaStreamSuite.scala:200)
>   at 
> org.apache.spark.streaming.kafka.KafkaStreamSuite.beforeFunction(KafkaStreamSuite.scala:62)
>   at 
> org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.setUp(JavaKafkaStreamSuite.java:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:27)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:24)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:136)
>   at com.novocode.junit.JUnitRunner.run(JUnitRunner.java:90)
>   at sbt.RunnerWrapper$1.runRunner2(FrameworkWrapper.java:223)
>   at sbt.RunnerWrapper$1.execute(FrameworkWrapper.java:236)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3681) Failed to serialized ArrayType or MapType after accessing them in Python

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3681:
---
Component/s: PySpark

> Failed to serialized ArrayType or MapType  after accessing them in Python
> -
>
> Key: SPARK-3681
> URL: https://issues.apache.org/jira/browse/SPARK-3681
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> {code}
> files_schema_rdd.map(lambda x: x.files).take(1)
> {code}
> Also it will lose the schema after iterate an ArrayType.
> {code}
> files_schema_rdd.map(lambda x: [f.batch for f in x.files]).take(1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3663) Document SPARK_LOG_DIR and SPARK_PID_DIR

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3663:
---
Component/s: Documentation

> Document SPARK_LOG_DIR and SPARK_PID_DIR
> 
>
> Key: SPARK-3663
> URL: https://issues.apache.org/jira/browse/SPARK-3663
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>
> I'm using these two parameters in some puppet scripts for standalone 
> deployment and realized that they're not documented anywhere.  We should 
> document them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3610) History server log name should not be based on user input

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3610:
---
Component/s: Spark Core

> History server log name should not be based on user input
> -
>
> Key: SPARK-3610
> URL: https://issues.apache.org/jira/browse/SPARK-3610
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: SK
>Priority: Critical
>
> Right now we use the user-defined application name when creating the logging 
> file for the history server. We should use some type of GUID generated from 
> inside of Spark instead of allowing user input here. It can cause errors if 
> users provide characters that are not valid in filesystem paths.
> Original bug report:
> {quote}
> The default log files for the Mllib examples use a rather long naming 
> convention that includes special characters like parentheses and comma.For 
> e.g. one of my log files is named 
> "binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032".
> When I click on the program on the history server page (at port 18080), to 
> view the detailed application logs, the history server crashes and I need to 
> restart it. I am using Spark 1.1 on a mesos cluster.
> I renamed the  log file by removing the special characters and  then it loads 
> up correctly. I am not sure which program is creating the log files. Can it 
> be changed so that the default log file naming convention does not include  
> special characters? 
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3604.

Resolution: Not a Problem

> unbounded recursion in getNumPartitions triggers stack overflow for large 
> UnionRDD
> --
>
> Key: SPARK-3604
> URL: https://issues.apache.org/jira/browse/SPARK-3604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: linux.  Used python, but error is in Scala land.
>Reporter: Eric Friedman
>Priority: Critical
>
> I have a large number of parquet files all with the same schema and attempted 
> to make a UnionRDD out of them.
> When I call getNumPartitions(), I get a stack overflow error
> that looks like this:
> Py4JJavaError: An error occurred while calling o3275.partitions.
> : java.lang.StackOverflowError
>   at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3684) Can't configure local dirs in Yarn mode

2014-09-24 Thread Andrew Or (JIRA)
Andrew Or created SPARK-3684:


 Summary: Can't configure local dirs in Yarn mode
 Key: SPARK-3684
 URL: https://issues.apache.org/jira/browse/SPARK-3684
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Andrew Or


We can't set SPARK_LOCAL_DIRS or spark.local.dirs because they're not picked up 
in Yarn mode. However, we can't set YARN_LOCAL_DIRS or LOCAL_DIRS either 
because these are overridden by Yarn.

I'm trying to set these through SPARK_YARN_USER_ENV. I'm aware that the default 
behavior is for Spark to use Yarn's local dirs, but right now there's no way to 
change it even if the user wants to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2691:
---
Assignee: Tim Chen  (was: Timothy Hunter)

> Allow Spark on Mesos to be launched with Docker
> ---
>
> Key: SPARK-2691
> URL: https://issues.apache.org/jira/browse/SPARK-2691
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Tim Chen
>  Labels: mesos
>
> Currently to launch Spark with Mesos one must upload a tarball and specifiy 
> the executor URI to be passed in that is to be downloaded on each slave or 
> even each execution depending coarse mode or not.
> We want to make Spark able to support launching Executors via a Docker image 
> that utilizes the recent Docker and Mesos integration work. 
> With the recent integration Spark can simply specify a Docker image and 
> options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2691:
---
Assignee: Timothy Hunter

> Allow Spark on Mesos to be launched with Docker
> ---
>
> Key: SPARK-2691
> URL: https://issues.apache.org/jira/browse/SPARK-2691
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Timothy Hunter
>  Labels: mesos
>
> Currently to launch Spark with Mesos one must upload a tarball and specifiy 
> the executor URI to be passed in that is to be downloaded on each slave or 
> even each execution depending coarse mode or not.
> We want to make Spark able to support launching Executors via a Docker image 
> that utilizes the recent Docker and Mesos integration work. 
> With the recent integration Spark can simply specify a Docker image and 
> options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2691:
---
Assignee: Timothy Chen  (was: Tim Chen)

> Allow Spark on Mesos to be launched with Docker
> ---
>
> Key: SPARK-2691
> URL: https://issues.apache.org/jira/browse/SPARK-2691
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Timothy Chen
>  Labels: mesos
>
> Currently to launch Spark with Mesos one must upload a tarball and specifiy 
> the executor URI to be passed in that is to be downloaded on each slave or 
> even each execution depending coarse mode or not.
> We want to make Spark able to support launching Executors via a Docker image 
> that utilizes the recent Docker and Mesos integration work. 
> With the recent integration Spark can simply specify a Docker image and 
> options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3678) Yarn app name reported in RM is different between cluster and client mode

2014-09-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3678:
-
Affects Version/s: (was: 1.2.0)
   1.1.0

> Yarn app name reported in RM is different between cluster and client mode
> -
>
> Key: SPARK-3678
> URL: https://issues.apache.org/jira/browse/SPARK-3678
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>
> If you launch an application in yarn cluster mode the name of the application 
> in the ResourceManager generally shows up as the full name 
> org.apache.spark.examples.SparkHdfsLR.  If you start the same app in client 
> mode it shows up as SparkHdfsLR.
> We should be consistent between them.  
> I haven't looked at it in detail, perhaps its only the examples but I think 
> I've seen this with customer apps also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3683) PySpark Hive query generates "NULL" instead of None

2014-09-24 Thread Tamas Jambor (JIRA)
Tamas Jambor created SPARK-3683:
---

 Summary: PySpark Hive query generates "NULL" instead of None
 Key: SPARK-3683
 URL: https://issues.apache.org/jira/browse/SPARK-3683
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.1.0
Reporter: Tamas Jambor


When I run a Hive query in Spark SQL, I get the new Row object, where it does 
not convert Hive NULL into Python None instead it keeps it string 'NULL'. 

It's only an issue with String type, works with other types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-889) Bring back DFS broadcast

2014-09-24 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147047#comment-14147047
 ] 

Reynold Xin commented on SPARK-889:
---

Yea I think we should close this as won't fix for now.

> Bring back DFS broadcast
> 
>
> Key: SPARK-889
> URL: https://issues.apache.org/jira/browse/SPARK-889
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>Priority: Minor
>
> DFS broadcast was a simple way to get better-than-single-master performance 
> for broadcast, so we should add it back for people who have HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-889) Bring back DFS broadcast

2014-09-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-889.
---
Resolution: Won't Fix

> Bring back DFS broadcast
> 
>
> Key: SPARK-889
> URL: https://issues.apache.org/jira/browse/SPARK-889
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>Priority: Minor
>
> DFS broadcast was a simple way to get better-than-single-master performance 
> for broadcast, so we should add it back for people who have HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3639) Kinesis examples set master as local

2014-09-24 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147042#comment-14147042
 ] 

Josh Rosen commented on SPARK-3639:
---

This sounds reasonable to me; feel free to open a PR.  If you look at most of 
the other Spark examples, they only set the appName when creating the 
SparkContext and leave the master unspecified in order to allow it to be set 
when passing the script to {{spark-submit}}.

> Kinesis examples set master as local
> 
>
> Key: SPARK-3639
> URL: https://issues.apache.org/jira/browse/SPARK-3639
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Streaming
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Aniket Bhatnagar
>Priority: Minor
>  Labels: examples
>
> Kinesis examples set master as local thus not allowing the example to be 
> tested on a cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-889) Bring back DFS broadcast

2014-09-24 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147032#comment-14147032
 ] 

Josh Rosen commented on SPARK-889:
--

In fact, I think [~rxin] has some JIRAs and PRs to make TorrentBroadcast _even_ 
better than it is now (it was greatly improved from 1.0.2 to 1.1.0), so it's 
probably safe to close this.

> Bring back DFS broadcast
> 
>
> Key: SPARK-889
> URL: https://issues.apache.org/jira/browse/SPARK-889
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>Priority: Minor
>
> DFS broadcast was a simple way to get better-than-single-master performance 
> for broadcast, so we should add it back for people who have HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3682) Add helpful warnings to the UI

2014-09-24 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3682:
--
Target Version/s: 1.3.0  (was: 1.2.0)

> Add helpful warnings to the UI
> --
>
> Key: SPARK-3682
> URL: https://issues.apache.org/jira/browse/SPARK-3682
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 1.1.0
>Reporter: Sandy Ryza
>
> Spark has a zillion configuration options and a zillion different things that 
> can go wrong with a job.  Improvements like incremental and better metrics 
> and the proposed spark replay debugger provide more insight into what's going 
> on under the covers.  However, it's difficult for non-advanced users to 
> synthesize this information and understand where to direct their attention. 
> It would be helpful to have some sort of central location on the UI users 
> could go to that would provide indications about why an app/job is failing or 
> performing poorly.
> Some helpful messages that we could provide:
> * Warn that the tasks in a particular stage are spending a long time in GC.
> * Warn that spark.shuffle.memoryFraction does not fit inside the young 
> generation.
> * Warn that tasks in a particular stage are very short, and that the number 
> of partitions should probably be decreased.
> * Warn that tasks in a particular stage are spilling a lot, and that the 
> number of partitions should probably be decreased.
> * Warn that a cached RDD that gets a lot of use does not fit in memory, and a 
> lot of time is being spent recomputing it.
> To start, probably two kinds of warnings would be most helpful.
> * Warnings at the app level that report on misconfigurations, issues with the 
> general health of executors.
> * Warnings at the job level that indicate why a job might be performing 
> slowly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3682) Add helpful warnings to the UI

2014-09-24 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3682:
-
 Target Version/s: 1.2.0
Affects Version/s: 1.1.0

> Add helpful warnings to the UI
> --
>
> Key: SPARK-3682
> URL: https://issues.apache.org/jira/browse/SPARK-3682
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 1.1.0
>Reporter: Sandy Ryza
>
> Spark has a zillion configuration options and a zillion different things that 
> can go wrong with a job.  Improvements like incremental and better metrics 
> and the proposed spark replay debugger provide more insight into what's going 
> on under the covers.  However, it's difficult for non-advanced users to 
> synthesize this information and understand where to direct their attention. 
> It would be helpful to have some sort of central location on the UI users 
> could go to that would provide indications about why an app/job is failing or 
> performing poorly.
> Some helpful messages that we could provide:
> * Warn that the tasks in a particular stage are spending a long time in GC.
> * Warn that spark.shuffle.memoryFraction does not fit inside the young 
> generation.
> * Warn that tasks in a particular stage are very short, and that the number 
> of partitions should probably be decreased.
> * Warn that tasks in a particular stage are spilling a lot, and that the 
> number of partitions should probably be decreased.
> * Warn that a cached RDD that gets a lot of use does not fit in memory, and a 
> lot of time is being spent recomputing it.
> To start, probably two kinds of warnings would be most helpful.
> * Warnings at the app level that report on misconfigurations, issues with the 
> general health of executors.
> * Warnings at the job level that indicate why a job might be performing 
> slowly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2131) Collect per-task filesystem-bytes-read/written metrics

2014-09-24 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-2131.
---
Resolution: Duplicate

> Collect per-task filesystem-bytes-read/written metrics
> --
>
> Key: SPARK-2131
> URL: https://issues.apache.org/jira/browse/SPARK-2131
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3682) Add helpful warnings to the UI

2014-09-24 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-3682:
-

 Summary: Add helpful warnings to the UI
 Key: SPARK-3682
 URL: https://issues.apache.org/jira/browse/SPARK-3682
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Sandy Ryza


Spark has a zillion configuration options and a zillion different things that 
can go wrong with a job.  Improvements like incremental and better metrics and 
the proposed spark replay debugger provide more insight into what's going on 
under the covers.  However, it's difficult for non-advanced users to synthesize 
this information and understand where to direct their attention. It would be 
helpful to have some sort of central location on the UI users could go to that 
would provide indications about why an app/job is failing or performing poorly.

Some helpful messages that we could provide:
* Warn that the tasks in a particular stage are spending a long time in GC.
* Warn that spark.shuffle.memoryFraction does not fit inside the young 
generation.
* Warn that tasks in a particular stage are very short, and that the number of 
partitions should probably be decreased.
* Warn that tasks in a particular stage are spilling a lot, and that the number 
of partitions should probably be decreased.
* Warn that a cached RDD that gets a lot of use does not fit in memory, and a 
lot of time is being spent recomputing it.

To start, probably two kinds of warnings would be most helpful.
* Warnings at the app level that report on misconfigurations, issues with the 
general health of executors.
* Warnings at the job level that indicate why a job might be performing slowly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3681) Failed to serialized ArrayType or MapType after accessing them in Python

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146903#comment-14146903
 ] 

Apache Spark commented on SPARK-3681:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2526

> Failed to serialized ArrayType or MapType  after accessing them in Python
> -
>
> Key: SPARK-3681
> URL: https://issues.apache.org/jira/browse/SPARK-3681
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> {code}
> files_schema_rdd.map(lambda x: x.files).take(1)
> {code}
> Also it will lose the schema after iterate an ArrayType.
> {code}
> files_schema_rdd.map(lambda x: [f.batch for f in x.files]).take(1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3681) Failed to serialized ArrayType or MapType after accessing them in Python

2014-09-24 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3681:
-

 Summary: Failed to serialized ArrayType or MapType  after 
accessing them in Python
 Key: SPARK-3681
 URL: https://issues.apache.org/jira/browse/SPARK-3681
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


{code}
files_schema_rdd.map(lambda x: x.files).take(1)
{code}

Also it will lose the schema after iterate an ArrayType.

{code}
files_schema_rdd.map(lambda x: [f.batch for f in x.files]).take(1)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3679) pickle the exact globals of functions

2014-09-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3679.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2522
[https://github.com/apache/spark/pull/2522]

> pickle the exact globals of functions
> -
>
> Key: SPARK-3679
> URL: https://issues.apache.org/jira/browse/SPARK-3679
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.2.0
>
>
> function.func_code.co_names has all the names used in the function, including 
> name of attributes. It will pickle some unnecessary globals if there is a 
> global having the same name with attribute (in co_names).
> There is a regression introduced by PR 2114 
> https://github.com/apache/spark/pull/2144/files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-889) Bring back DFS broadcast

2014-09-24 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146804#comment-14146804
 ] 

Andrew Ash commented on SPARK-889:
--

[~matei] should we close ticket this as Won't Fix then, since effort is better 
spent making TorrentBroadcast better?

> Bring back DFS broadcast
> 
>
> Key: SPARK-889
> URL: https://issues.apache.org/jira/browse/SPARK-889
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>Priority: Minor
>
> DFS broadcast was a simple way to get better-than-single-master performance 
> for broadcast, so we should add it back for people who have HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3634) Python modules added through addPyFile should take precedence over system modules

2014-09-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3634.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2492
[https://github.com/apache/spark/pull/2492]

> Python modules added through addPyFile should take precedence over system 
> modules
> -
>
> Key: SPARK-3634
> URL: https://issues.apache.org/jira/browse/SPARK-3634
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Josh Rosen
> Fix For: 1.2.0
>
>
> Python modules added through {{SparkContext.addPyFile()}} are currently 
> _appended_ to {{sys.path}}; this is probably the opposite of the behavior 
> that we want, since it causes system versions of modules to take precedence 
> over versions explicitly added by users.
> To fix this, we should change the {{sys.path}} manipulation code in 
> {{context.py}} and {{worker.py}} to prepend files to {{sys.path}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3680) java.lang.Exception: makeCopy when using HiveGeneric UDFs on Converted Parquet Metastore tables

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146738#comment-14146738
 ] 

Apache Spark commented on SPARK-3680:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/2525

> java.lang.Exception: makeCopy when using HiveGeneric UDFs on Converted 
> Parquet Metastore tables
> ---
>
> Key: SPARK-3680
> URL: https://issues.apache.org/jira/browse/SPARK-3680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3680) java.lang.Exception: makeCopy when using HiveGeneric UDFs on Converted Parquet Metastore tables

2014-09-24 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-3680:
---

 Summary: java.lang.Exception: makeCopy when using HiveGeneric UDFs 
on Converted Parquet Metastore tables
 Key: SPARK-3680
 URL: https://issues.apache.org/jira/browse/SPARK-3680
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146718#comment-14146718
 ] 

Apache Spark commented on SPARK-3628:
-

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/2524

> Don't apply accumulator updates multiple times for tasks in result stages
> -
>
> Key: SPARK-3628
> URL: https://issues.apache.org/jira/browse/SPARK-3628
> Project: Spark
>  Issue Type: Bug
>Reporter: Matei Zaharia
>Assignee: Nan Zhu
>Priority: Blocker
>
> In previous versions of Spark, accumulator updates only got applied once for 
> accumulators that are only used in actions (i.e. result stages), letting you 
> use them to deterministically compute a result. Unfortunately, this got 
> broken in some recent refactorings.
> This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
> issue is about applying the same semantics to intermediate stages too, which 
> is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-09-24 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146714#comment-14146714
 ] 

Nan Zhu commented on SPARK-3628:


https://github.com/apache/spark/pull/2524

> Don't apply accumulator updates multiple times for tasks in result stages
> -
>
> Key: SPARK-3628
> URL: https://issues.apache.org/jira/browse/SPARK-3628
> Project: Spark
>  Issue Type: Bug
>Reporter: Matei Zaharia
>Priority: Blocker
>
> In previous versions of Spark, accumulator updates only got applied once for 
> accumulators that are only used in actions (i.e. result stages), letting you 
> use them to deterministically compute a result. Unfortunately, this got 
> broken in some recent refactorings.
> This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
> issue is about applying the same semantics to intermediate stages too, which 
> is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3659) Set EC2 version to 1.1.0 in master branch

2014-09-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3659.

   Resolution: Fixed
Fix Version/s: 1.2.0

https://github.com/apache/spark/pull/2510

> Set EC2 version to 1.1.0 in master branch
> -
>
> Key: SPARK-3659
> URL: https://issues.apache.org/jira/browse/SPARK-3659
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>Priority: Minor
> Fix For: 1.2.0
>
>
> Master branch should be in sync with branch-1.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3679) pickle the exact globals of functions

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146691#comment-14146691
 ] 

Apache Spark commented on SPARK-3679:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2522

> pickle the exact globals of functions
> -
>
> Key: SPARK-3679
> URL: https://issues.apache.org/jira/browse/SPARK-3679
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> function.func_code.co_names has all the names used in the function, including 
> name of attributes. It will pickle some unnecessary globals if there is a 
> global having the same name with attribute (in co_names).
> There is a regression introduced by PR 2114 
> https://github.com/apache/spark/pull/2144/files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3679) pickle the exact globals of functions

2014-09-24 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3679:
-

 Summary: pickle the exact globals of functions
 Key: SPARK-3679
 URL: https://issues.apache.org/jira/browse/SPARK-3679
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Davies Liu
Priority: Critical


function.func_code.co_names has all the names used in the function, including 
name of attributes. It will pickle some unnecessary globals if there is a 
global having the same name with attribute (in co_names).

There is a regression introduced by PR 2114 
https://github.com/apache/spark/pull/2144/files





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3466) Limit size of results that a driver collects for each action

2014-09-24 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-3466:
--
Description: Right now, operations like {{collect()}} and {{take()}} can 
crash the driver with an OOM if they bring back too many data. We should add a 
{{spark.driver.maxResultSize}} setting (or something like that) that will make 
the driver abort a job if its result is too big. We can set it to some fraction 
of the driver's memory by default, or to something like 100 MB.  (was: Right 
now, operations like collect() and take() can crash the driver if they bring 
back too many data. We should add a spark.driver.maxResultSize setting (or 
something like that) that will make the driver abort a job if its result is too 
big. We can set it to some fraction of the driver's memory by default, or to 
something like 100 MB.)

> Limit size of results that a driver collects for each action
> 
>
> Key: SPARK-3466
> URL: https://issues.apache.org/jira/browse/SPARK-3466
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>
> Right now, operations like {{collect()}} and {{take()}} can crash the driver 
> with an OOM if they bring back too many data. We should add a 
> {{spark.driver.maxResultSize}} setting (or something like that) that will 
> make the driver abort a job if its result is too big. We can set it to some 
> fraction of the driver's memory by default, or to something like 100 MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example

2014-09-24 Thread Evan Samanas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146639#comment-14146639
 ] 

Evan Samanas commented on SPARK-3662:
-

I wouldn't focus on the example, that I modified it, or whether I should be 
importing a small portion of pandas.  The issue here is that Spark breaks in 
this case because of a name collision.  Modifying the example is simply the one 
reproducer I've found.

I was modifying the example to learn about how Spark ships Python code to the 
cluster.  In this case, I expected pandas to only be imported in the driver 
program and not to be imported by any workers.  The workers do not have pandas 
installed, so expected behavior means the example would run to completion, and 
an ImportError would mean that the workers are importing things they don't need 
for the task at hand.

The way I expected Spark to work IS actually how Spark works...modules will 
only be imported by workers if a function passed to them uses the modules, but 
this error showed me false evidence to the contrary.  I'm assuming the error is 
in Spark's modifications to CloudPickle...not in the way the example is set up.

> Importing pandas breaks included pi.py example
> --
>
> Key: SPARK-3662
> URL: https://issues.apache.org/jira/browse/SPARK-3662
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.1.0
> Environment: Xubuntu 14.04.  Yarn cluster running on Ubuntu 12.04.
>Reporter: Evan Samanas
>
> If I add "import pandas" at the top of the included pi.py example and submit 
> using "spark-submit --master yarn-client", I get this stack trace:
> {code}
> Traceback (most recent call last):
>   File "/home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py", line 
> 39, in 
> count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add)
>   File "/home/evan/pub_src/spark/python/pyspark/rdd.py", line 759, in reduce
> vals = self.mapPartitions(func).collect()
>   File "/home/evan/pub_src/spark/python/pyspark/rdd.py", line 723, in collect
> bytesInJava = self._jrdd.collect().iterator()
>   File 
> "/home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", 
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 
> 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: 
> org.apache.spark.api.python.PythonException (Traceback (most recent call 
> last):
>   File 
> "/yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py",
>  line 75, in main
> command = pickleSer._read_with_length(infile)
>   File 
> "/yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
>  line 150, in _read_with_length
> return self.loads(obj)
> ImportError: No module named algos
> {code}
> The example works fine if I move the statement "from random import random" 
> from the top and into the function (def f(_)) defined in the example.  Near 
> as I can tell, "random" is getting confused with a function of the same name 
> within pandas.algos.  
> Submitting the same script using --master local works, but gives a 
> distressing amount of random characters to stdout or stderr and messes up my 
> terminal:
> {code}
> ...
> @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J 
> @J!@J"@J#@J$@J%@J&@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J<@J=@J>@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ
>AJ
> AJ
>   AJ
> AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ 
> AJ!AJ"AJ#AJ$AJ%AJ&AJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23
>  15:42:09 INFO SparkContext: Job finished: reduce at 
> /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 
> 11.276879779 s
> J�AJ�AJ�AJ�AJ�AJ�AJ�AJ

[jira] [Commented] (SPARK-3466) Limit size of results that a driver collects for each action

2014-09-24 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146640#comment-14146640
 ] 

Andrew Ash commented on SPARK-3466:
---

How would you design this feature?

I can imagine measuring the size of partitions / RDD elements while they are 
held in memory across the cluster, sending those sizes back to the driver, and 
having the driver throw an exception if the requested size exceeds the 
threshold.  Otherwise proceed as normal.

Is that how you were envisioning implementation?

> Limit size of results that a driver collects for each action
> 
>
> Key: SPARK-3466
> URL: https://issues.apache.org/jira/browse/SPARK-3466
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>
> Right now, operations like collect() and take() can crash the driver if they 
> bring back too many data. We should add a spark.driver.maxResultSize setting 
> (or something like that) that will make the driver abort a job if its result 
> is too big. We can set it to some fraction of the driver's memory by default, 
> or to something like 100 MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3678) Yarn app name reported in RM is different between cluster and client mode

2014-09-24 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3678:


 Summary: Yarn app name reported in RM is different between cluster 
and client mode
 Key: SPARK-3678
 URL: https://issues.apache.org/jira/browse/SPARK-3678
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves


If you launch an application in yarn cluster mode the name of the application 
in the ResourceManager generally shows up as the full name 
org.apache.spark.examples.SparkHdfsLR.  If you start the same app in client 
mode it shows up as SparkHdfsLR.

We should be consistent between them.  

I haven't looked at it in detail, perhaps its only the examples but I think 
I've seen this with customer apps also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-09-24 Thread Ryan D Braley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146246#comment-14146246
 ] 

Ryan D Braley commented on SPARK-2691:
--

+1 Spark typically lags behind mesos in version numbers so if you run mesos 
today you have to choose between spark and docker. With this we could have our 
cake and eat it too :) 

> Allow Spark on Mesos to be launched with Docker
> ---
>
> Key: SPARK-2691
> URL: https://issues.apache.org/jira/browse/SPARK-2691
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>  Labels: mesos
>
> Currently to launch Spark with Mesos one must upload a tarball and specifiy 
> the executor URI to be passed in that is to be downloaded on each slave or 
> even each execution depending coarse mode or not.
> We want to make Spark able to support launching Executors via a Docker image 
> that utilizes the recent Docker and Mesos integration work. 
> With the recent integration Spark can simply specify a Docker image and 
> options that is needed and it should continue to work as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3639) Kinesis examples set master as local

2014-09-24 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146245#comment-14146245
 ] 

Matthew Farrellee commented on SPARK-3639:
--

seems reasonable to me

> Kinesis examples set master as local
> 
>
> Key: SPARK-3639
> URL: https://issues.apache.org/jira/browse/SPARK-3639
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Streaming
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Aniket Bhatnagar
>Priority: Minor
>  Labels: examples
>
> Kinesis examples set master as local thus not allowing the example to be 
> tested on a cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3677) Scalastyle is never applyed to the sources under yarn/common

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146171#comment-14146171
 ] 

Apache Spark commented on SPARK-3677:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2520

> Scalastyle is never applyed to the sources under yarn/common
> 
>
> Key: SPARK-3677
> URL: https://issues.apache.org/jira/browse/SPARK-3677
> Project: Spark
>  Issue Type: Bug
>  Components: Build, YARN
>Affects Versions: 1.2.0
>Reporter: Kousuke Saruta
>
> When we run "sbt -Pyarn scalastyle", scalastyle is not applied to the sources 
> under yarn/common.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3677) Scalastyle is never applyed to the sources under yarn/common

2014-09-24 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3677:
-

 Summary: Scalastyle is never applyed to the sources under 
yarn/common
 Key: SPARK-3677
 URL: https://issues.apache.org/jira/browse/SPARK-3677
 Project: Spark
  Issue Type: Bug
  Components: Build, YARN
Affects Versions: 1.2.0
Reporter: Kousuke Saruta


When we run "sbt -Pyarn scalastyle", scalastyle is not applied to the sources 
under yarn/common.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3526) Docs section on data locality

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146094#comment-14146094
 ] 

Apache Spark commented on SPARK-3526:
-

User 'ash211' has created a pull request for this issue:
https://github.com/apache/spark/pull/2519

> Docs section on data locality
> -
>
> Key: SPARK-3526
> URL: https://issues.apache.org/jira/browse/SPARK-3526
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.0.2
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>
> Several threads on the mailing list have been about data locality and how to 
> interpret PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, etc.  Let's get some more 
> details in the docs on this concept so we can point future questions there.
> A couple people appreciated the below description of locality so it could be 
> a good starting point:
> {quote}
> The locality is how close the data is to the code that's processing it.  
> PROCESS_LOCAL means data is in the same JVM as the code that's running, so 
> it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the same 
> node, or in another executor on the same node, so is a little slower because 
> the data has to travel across an IPC connection.  RACK_LOCAL is even slower 
> -- data is on a different server so needs to be sent over the network.
> Spark switches to lower locality levels when there's no unprocessed data on a 
> node that has idle CPUs.  In that situation you have two options: wait until 
> the busy CPUs free up so you can start another task that uses data on that 
> server, or start a new task on a farther away server that needs to bring data 
> from that remote place.  What Spark typically does is wait a bit in the hopes 
> that a busy CPU frees up.  Once that timeout expires, it starts moving the 
> data from far away to the free CPU.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3676) jdk version lead to spark sql test suite error

2014-09-24 Thread wangfei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146050#comment-14146050
 ] 

wangfei commented on SPARK-3676:


hmm, i see, thanks for that.

> jdk version lead to spark sql test suite error
> --
>
> Key: SPARK-3676
> URL: https://issues.apache.org/jira/browse/SPARK-3676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> System.out.println(1/500d)  get different result in diff jdk version
> jdk 1.6.0(_31)  0.0020
> jdk 1.7.0(_05)  0.002
> this will lead to  spark sql hive test suite failed (replay by set jdk 
> version = 1.6.0_31)--- 
> [info] - division *** FAILED ***
> [info]   Results do not match for division:
> [info]   SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1
> [info]   == Parsed Logical Plan ==
> [info]   Limit 1
> [info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS 
> c_2#694,(1 / COUNT(1)) AS c_3#695]
> [info] UnresolvedRelation None, src, None
> [info]   
> [info]   == Analyzed Logical Plan ==
> [info]   Limit 1
> [info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS 
> c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, 
> DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub
> leType) / CAST(COUNT(1), DoubleType)) AS c_3#695]
> [info] MetastoreRelation default, src, None
> [info]   
> [info]   == Optimized Logical Plan ==
> [info]   Limit 1
> [info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS 
> c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695]
> [info] Project []
> [info]  MetastoreRelation default, src, None
> [info]   
> [info]   == Physical Plan ==
> [info]   Limit 1
> [info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS 
> c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), 
> DoubleType)) AS c_3#695]
> [info] Exchange SinglePartition
> [info]  Aggregate true, [], [COUNT(1) AS PartialCount#699L]
> [info]   HiveTableScan [], (MetastoreRelation default, src, None), None
> [info]   
> [info]   Code Generation: false
> [info]   == RDD ==
> [info]   c_0c_1 c_2 c_3
> [info]   !== HIVE - 1 row(s) ==  == CATALYST - 1 row(s) ==
> [info]   !2.0   0.5 0.  0.002   2.0 0.5 
> 0.  0.0020 (HiveComparisonTest.scala:370)
> [info] - timestamp cast #1 *** FAILED ***
> [info]   Results do not match for timestamp cast #1:
> [info]   SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1
> [info]   == Parsed Logical Plan ==
> [info]   Limit 1
> [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
> [info] UnresolvedRelation None, src, None
> [info]   
> [info]   == Analyzed Logical Plan ==
> [info]   Limit 1
> [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
> [info] MetastoreRelation default, src, None
> [info]   
> [info]   == Optimized Logical Plan ==
> [info]   Limit 1
> [info]Project [0.0010 AS c_0#995]
> [info] MetastoreRelation default, src, None
> [info]   
> [info]   == Physical Plan ==
> [info]   Limit 1
> [info]Project [0.0010 AS c_0#995]
> [info] HiveTableScan [], (MetastoreRelation default, src, None), None
> [info]   
> [info]   Code Generation: false
> [info]   == RDD ==
> [info]   c_0
> [info]   !== HIVE - 1 row(s) ==   == CATALYST - 1 row(s) ==
> [info]   !0.001   0.0010 (HiveComparisonTest.scala:370)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3663) Document SPARK_LOG_DIR and SPARK_PID_DIR

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146041#comment-14146041
 ] 

Apache Spark commented on SPARK-3663:
-

User 'ash211' has created a pull request for this issue:
https://github.com/apache/spark/pull/2518

> Document SPARK_LOG_DIR and SPARK_PID_DIR
> 
>
> Key: SPARK-3663
> URL: https://issues.apache.org/jira/browse/SPARK-3663
> Project: Spark
>  Issue Type: Documentation
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>
> I'm using these two parameters in some puppet scripts for standalone 
> deployment and realized that they're not documented anywhere.  We should 
> document them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3676) jdk version lead to spark sql test suite error

2014-09-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146040#comment-14146040
 ] 

Sean Owen commented on SPARK-3676:
--

(For the interested, I looked it up, since the behavior change sounds 
surprising. This is in fact a bug in Java 6 that was fixed in Java 7: 
http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4428022 It may even be 
fixed in later versions of Java 6, but I have a very recent one and it is not.)

> jdk version lead to spark sql test suite error
> --
>
> Key: SPARK-3676
> URL: https://issues.apache.org/jira/browse/SPARK-3676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> System.out.println(1/500d)  get different result in diff jdk version
> jdk 1.6.0(_31)  0.0020
> jdk 1.7.0(_05)  0.002
> this will lead to  spark sql hive test suite failed (replay by set jdk 
> version = 1.6.0_31)--- 
> [info] - division *** FAILED ***
> [info]   Results do not match for division:
> [info]   SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1
> [info]   == Parsed Logical Plan ==
> [info]   Limit 1
> [info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS 
> c_2#694,(1 / COUNT(1)) AS c_3#695]
> [info] UnresolvedRelation None, src, None
> [info]   
> [info]   == Analyzed Logical Plan ==
> [info]   Limit 1
> [info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS 
> c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, 
> DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub
> leType) / CAST(COUNT(1), DoubleType)) AS c_3#695]
> [info] MetastoreRelation default, src, None
> [info]   
> [info]   == Optimized Logical Plan ==
> [info]   Limit 1
> [info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS 
> c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695]
> [info] Project []
> [info]  MetastoreRelation default, src, None
> [info]   
> [info]   == Physical Plan ==
> [info]   Limit 1
> [info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS 
> c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), 
> DoubleType)) AS c_3#695]
> [info] Exchange SinglePartition
> [info]  Aggregate true, [], [COUNT(1) AS PartialCount#699L]
> [info]   HiveTableScan [], (MetastoreRelation default, src, None), None
> [info]   
> [info]   Code Generation: false
> [info]   == RDD ==
> [info]   c_0c_1 c_2 c_3
> [info]   !== HIVE - 1 row(s) ==  == CATALYST - 1 row(s) ==
> [info]   !2.0   0.5 0.  0.002   2.0 0.5 
> 0.  0.0020 (HiveComparisonTest.scala:370)
> [info] - timestamp cast #1 *** FAILED ***
> [info]   Results do not match for timestamp cast #1:
> [info]   SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1
> [info]   == Parsed Logical Plan ==
> [info]   Limit 1
> [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
> [info] UnresolvedRelation None, src, None
> [info]   
> [info]   == Analyzed Logical Plan ==
> [info]   Limit 1
> [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
> [info] MetastoreRelation default, src, None
> [info]   
> [info]   == Optimized Logical Plan ==
> [info]   Limit 1
> [info]Project [0.0010 AS c_0#995]
> [info] MetastoreRelation default, src, None
> [info]   
> [info]   == Physical Plan ==
> [info]   Limit 1
> [info]Project [0.0010 AS c_0#995]
> [info] HiveTableScan [], (MetastoreRelation default, src, None), None
> [info]   
> [info]   Code Generation: false
> [info]   == RDD ==
> [info]   c_0
> [info]   !== HIVE - 1 row(s) ==   == CATALYST - 1 row(s) ==
> [info]   !0.001   0.0010 (HiveComparisonTest.scala:370)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3267) Deadlock between ScalaReflectionLock and Data type initialization

2014-09-24 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146039#comment-14146039
 ] 

Aaron Davidson commented on SPARK-3267:
---

I don't have it anymore, unfortunately. Michael and I did a little digging at 
the time, and I think we found the reason for the deadlock, shown in the stack 
traces above, but decided it was a very unlikely scenario. Indeed, the query 
did not consistently deadlock; this only occurred a single time.

> Deadlock between ScalaReflectionLock and Data type initialization
> -
>
> Key: SPARK-3267
> URL: https://issues.apache.org/jira/browse/SPARK-3267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Aaron Davidson
>Priority: Critical
>
> Deadlock here:
> {code}
> "Executor task launch worker-0" daemon prio=10 tid=0x7fab50036000 
> nid=0x27a in Object.wait() [0x7fab60c2e000
> ]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.defaultPrimitive(CodeGenerator.scala:565)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:202)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:195)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
> 93)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$Evaluate2$2.evaluateAs(CodeGenerator.scal
> a:175)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:304)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:195)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
> 93)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:314)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:195)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
> 93)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:313)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:195)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
> ...
> {code}
> and
> {code}
> "Executor task launch worker-2" daemon prio=10 tid=0x7fab100f0800 
> nid=0x27e in Object.wait() [0x7fab0eeec000
> ]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:250)
> - locked <0x00064e5d9a48> (a 
> org.apache.spark.sql.catalyst.expressions.Cast)
> at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
> at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
> at 
> org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
> scala:139)
> at 
> org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
> scala:139)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:139)
> at 
> org.apache.spark.sql.parquet.ParquetTableScan$$anonfu

[jira] [Commented] (SPARK-3676) jdk version lead to spark sql test suite error

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146035#comment-14146035
 ] 

Apache Spark commented on SPARK-3676:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2517

> jdk version lead to spark sql test suite error
> --
>
> Key: SPARK-3676
> URL: https://issues.apache.org/jira/browse/SPARK-3676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> System.out.println(1/500d)  get different result in diff jdk version
> jdk 1.6.0(_31)  0.0020
> jdk 1.7.0(_05)  0.002
> this will lead to  spark sql hive test suite failed (replay by set jdk 
> version = 1.6.0_31)--- 
> [info] - division *** FAILED ***
> [info]   Results do not match for division:
> [info]   SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1
> [info]   == Parsed Logical Plan ==
> [info]   Limit 1
> [info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS 
> c_2#694,(1 / COUNT(1)) AS c_3#695]
> [info] UnresolvedRelation None, src, None
> [info]   
> [info]   == Analyzed Logical Plan ==
> [info]   Limit 1
> [info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS 
> c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, 
> DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub
> leType) / CAST(COUNT(1), DoubleType)) AS c_3#695]
> [info] MetastoreRelation default, src, None
> [info]   
> [info]   == Optimized Logical Plan ==
> [info]   Limit 1
> [info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS 
> c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695]
> [info] Project []
> [info]  MetastoreRelation default, src, None
> [info]   
> [info]   == Physical Plan ==
> [info]   Limit 1
> [info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS 
> c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), 
> DoubleType)) AS c_3#695]
> [info] Exchange SinglePartition
> [info]  Aggregate true, [], [COUNT(1) AS PartialCount#699L]
> [info]   HiveTableScan [], (MetastoreRelation default, src, None), None
> [info]   
> [info]   Code Generation: false
> [info]   == RDD ==
> [info]   c_0c_1 c_2 c_3
> [info]   !== HIVE - 1 row(s) ==  == CATALYST - 1 row(s) ==
> [info]   !2.0   0.5 0.  0.002   2.0 0.5 
> 0.  0.0020 (HiveComparisonTest.scala:370)
> [info] - timestamp cast #1 *** FAILED ***
> [info]   Results do not match for timestamp cast #1:
> [info]   SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1
> [info]   == Parsed Logical Plan ==
> [info]   Limit 1
> [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
> [info] UnresolvedRelation None, src, None
> [info]   
> [info]   == Analyzed Logical Plan ==
> [info]   Limit 1
> [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
> [info] MetastoreRelation default, src, None
> [info]   
> [info]   == Optimized Logical Plan ==
> [info]   Limit 1
> [info]Project [0.0010 AS c_0#995]
> [info] MetastoreRelation default, src, None
> [info]   
> [info]   == Physical Plan ==
> [info]   Limit 1
> [info]Project [0.0010 AS c_0#995]
> [info] HiveTableScan [], (MetastoreRelation default, src, None), None
> [info]   
> [info]   Code Generation: false
> [info]   == RDD ==
> [info]   c_0
> [info]   !== HIVE - 1 row(s) ==   == CATALYST - 1 row(s) ==
> [info]   !0.001   0.0010 (HiveComparisonTest.scala:370)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3662) Importing pandas breaks included pi.py example

2014-09-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146017#comment-14146017
 ] 

Sean Owen commented on SPARK-3662:
--

Maybe I miss something, but, does this just mean you can't "import pandas" 
entirely? If you're modifying the example, you should import only what you need 
from pandas. Or, it may be that you need to modify the "import random", indeed, 
to accommodate other modifications you want to make.

But what is the problem with the included example? it runs fine without 
modifications, no?

> Importing pandas breaks included pi.py example
> --
>
> Key: SPARK-3662
> URL: https://issues.apache.org/jira/browse/SPARK-3662
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.1.0
> Environment: Xubuntu 14.04.  Yarn cluster running on Ubuntu 12.04.
>Reporter: Evan Samanas
>
> If I add "import pandas" at the top of the included pi.py example and submit 
> using "spark-submit --master yarn-client", I get this stack trace:
> {code}
> Traceback (most recent call last):
>   File "/home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py", line 
> 39, in 
> count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add)
>   File "/home/evan/pub_src/spark/python/pyspark/rdd.py", line 759, in reduce
> vals = self.mapPartitions(func).collect()
>   File "/home/evan/pub_src/spark/python/pyspark/rdd.py", line 723, in collect
> bytesInJava = self._jrdd.collect().iterator()
>   File 
> "/home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", 
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 
> 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: 
> org.apache.spark.api.python.PythonException (Traceback (most recent call 
> last):
>   File 
> "/yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py",
>  line 75, in main
> command = pickleSer._read_with_length(infile)
>   File 
> "/yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
>  line 150, in _read_with_length
> return self.loads(obj)
> ImportError: No module named algos
> {code}
> The example works fine if I move the statement "from random import random" 
> from the top and into the function (def f(_)) defined in the example.  Near 
> as I can tell, "random" is getting confused with a function of the same name 
> within pandas.algos.  
> Submitting the same script using --master local works, but gives a 
> distressing amount of random characters to stdout or stderr and messes up my 
> terminal:
> {code}
> ...
> @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J 
> @J!@J"@J#@J$@J%@J&@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J<@J=@J>@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ
>AJ
> AJ
>   AJ
> AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ 
> AJ!AJ"AJ#AJ$AJ%AJ&AJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJAJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23
>  15:42:09 INFO SparkContext: Job finished: reduce at 
> /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 
> 11.276879779 s
> J�AJ�AJ�AJ�AJ�AJ�AJ�A�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ
>  BJ
> BJ
>   BJ
> BJBJBJBJBJBJBJBJBJBJBJBJBJBJJBJBJBJBJ 
> BJ!BJ"BJ#BJ$BJ%BJ&BJ'BJ(BJ)BJ*BJ+BJ,BJ-BJ.BJ/BJ0BJ1BJ2BJ3BJ4BJ5BJ6BJ7BJ8BJ9BJ:BJ;BJBJ?BJ@Be.
> �]qJ#1a.
> �]qJX4a.
> �]qJX4a.
> �]qJ#1a.
> �]qJX4a.
> �]qJX4a.
> �]qJ#1a.
> �]qJX4a.
> �]qJX4a.
> �]qJa.
> Pi is roughly 3.146136
> {code}
> No idea if that's related, but thought I'd include it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For ad

[jira] [Commented] (SPARK-3620) Refactor config option handling code for spark-submit

2014-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146011#comment-14146011
 ] 

Apache Spark commented on SPARK-3620:
-

User 'tigerquoll' has created a pull request for this issue:
https://github.com/apache/spark/pull/2516

> Refactor config option handling code for spark-submit
> -
>
> Key: SPARK-3620
> URL: https://issues.apache.org/jira/browse/SPARK-3620
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Dale Richardson
>Assignee: Dale Richardson
>Priority: Minor
>
> I'm proposing its time to refactor the configuration argument handling code 
> in spark-submit. The code has grown organically in a short period of time, 
> handles a pretty complicated logic flow, and is now pretty fragile. Some 
> issues that have been identified:
> 1. Hand-crafted property file readers that do not support the property file 
> format as specified in 
> http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)
> 2. ResolveURI not called on paths read from conf/prop files
> 3. inconsistent means of merging / overriding values from different sources 
> (Some get overridden by file, others by manual settings of field on object, 
> Some by properties)
> 4. Argument validation should be done after combining config files, system 
> properties and command line arguments, 
> 5. Alternate conf file location not handled in shell scripts
> 6. Some options can only be passed as command line arguments
> 7. Defaults for options are hard-coded (and sometimes overridden multiple 
> times) in many through-out the code e.g. master = local[*]
> Initial proposal is to use typesafe conf to read in the config information 
> and merge the various config sources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3676) jdk version lead to spark sql test suite error

2014-09-24 Thread wangfei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-3676:
---
Summary: jdk version lead to spark sql test suite error  (was: jdk version 
lead to spark hql test suite error)

> jdk version lead to spark sql test suite error
> --
>
> Key: SPARK-3676
> URL: https://issues.apache.org/jira/browse/SPARK-3676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> System.out.println(1/500d)  get different result in diff jdk version
> jdk 1.6.0(_31)  0.0020
> jdk 1.7.0(_05)  0.002
> this will lead to  spark sql hive test suite failed (replay by set jdk 
> version = 1.6.0_31)--- 
> [info] - division *** FAILED ***
> [info]   Results do not match for division:
> [info]   SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1
> [info]   == Parsed Logical Plan ==
> [info]   Limit 1
> [info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS 
> c_2#694,(1 / COUNT(1)) AS c_3#695]
> [info] UnresolvedRelation None, src, None
> [info]   
> [info]   == Analyzed Logical Plan ==
> [info]   Limit 1
> [info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS 
> c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, 
> DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub
> leType) / CAST(COUNT(1), DoubleType)) AS c_3#695]
> [info] MetastoreRelation default, src, None
> [info]   
> [info]   == Optimized Logical Plan ==
> [info]   Limit 1
> [info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS 
> c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695]
> [info] Project []
> [info]  MetastoreRelation default, src, None
> [info]   
> [info]   == Physical Plan ==
> [info]   Limit 1
> [info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS 
> c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), 
> DoubleType)) AS c_3#695]
> [info] Exchange SinglePartition
> [info]  Aggregate true, [], [COUNT(1) AS PartialCount#699L]
> [info]   HiveTableScan [], (MetastoreRelation default, src, None), None
> [info]   
> [info]   Code Generation: false
> [info]   == RDD ==
> [info]   c_0c_1 c_2 c_3
> [info]   !== HIVE - 1 row(s) ==  == CATALYST - 1 row(s) ==
> [info]   !2.0   0.5 0.  0.002   2.0 0.5 
> 0.  0.0020 (HiveComparisonTest.scala:370)
> [info] - timestamp cast #1 *** FAILED ***
> [info]   Results do not match for timestamp cast #1:
> [info]   SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1
> [info]   == Parsed Logical Plan ==
> [info]   Limit 1
> [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
> [info] UnresolvedRelation None, src, None
> [info]   
> [info]   == Analyzed Logical Plan ==
> [info]   Limit 1
> [info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
> [info] MetastoreRelation default, src, None
> [info]   
> [info]   == Optimized Logical Plan ==
> [info]   Limit 1
> [info]Project [0.0010 AS c_0#995]
> [info] MetastoreRelation default, src, None
> [info]   
> [info]   == Physical Plan ==
> [info]   Limit 1
> [info]Project [0.0010 AS c_0#995]
> [info] HiveTableScan [], (MetastoreRelation default, src, None), None
> [info]   
> [info]   Code Generation: false
> [info]   == RDD ==
> [info]   c_0
> [info]   !== HIVE - 1 row(s) ==   == CATALYST - 1 row(s) ==
> [info]   !0.001   0.0010 (HiveComparisonTest.scala:370)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3676) jdk version lead to spark hql test suite error

2014-09-24 Thread wangfei (JIRA)
wangfei created SPARK-3676:
--

 Summary: jdk version lead to spark hql test suite error
 Key: SPARK-3676
 URL: https://issues.apache.org/jira/browse/SPARK-3676
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


System.out.println(1/500d)  get different result in diff jdk version
jdk 1.6.0(_31)  0.0020
jdk 1.7.0(_05)  0.002

this will lead to  spark sql hive test suite failed (replay by set jdk version 
= 1.6.0_31)--- 
[info] - division *** FAILED ***
[info]   Results do not match for division:
[info]   SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1
[info]   == Parsed Logical Plan ==
[info]   Limit 1
[info]Project [(2 / 1) AS c_0#692,(1 / 2) AS c_1#693,(1 / 3) AS c_2#694,(1 
/ COUNT(1)) AS c_3#695]
[info] UnresolvedRelation None, src, None
[info]   
[info]   == Analyzed Logical Plan ==
[info]   Limit 1
[info]Aggregate [], [(CAST(2, DoubleType) / CAST(1, DoubleType)) AS 
c_0#692,(CAST(1, DoubleType) / CAST(2, DoubleType)) AS c_1#693,(CAST(1, 
DoubleType) / CAST(3, DoubleType)) AS c_2#694,(CAST(CAST(1, LongType), Doub
leType) / CAST(COUNT(1), DoubleType)) AS c_3#695]
[info] MetastoreRelation default, src, None
[info]   
[info]   == Optimized Logical Plan ==
[info]   Limit 1
[info]Aggregate [], [2.0 AS c_0#692,0.5 AS c_1#693,0. AS 
c_2#694,(1.0 / CAST(COUNT(1), DoubleType)) AS c_3#695]
[info] Project []
[info]  MetastoreRelation default, src, None
[info]   
[info]   == Physical Plan ==
[info]   Limit 1
[info]Aggregate false, [], [2.0 AS c_0#692,0.5 AS 
c_1#693,0. AS c_2#694,(1.0 / CAST(SUM(PartialCount#699L), 
DoubleType)) AS c_3#695]
[info] Exchange SinglePartition
[info]  Aggregate true, [], [COUNT(1) AS PartialCount#699L]
[info]   HiveTableScan [], (MetastoreRelation default, src, None), None
[info]   
[info]   Code Generation: false
[info]   == RDD ==
[info]   c_0c_1 c_2 c_3
[info]   !== HIVE - 1 row(s) ==  == CATALYST - 1 row(s) ==
[info]   !2.0   0.5 0.  0.002   2.0 0.5 
0.  0.0020 (HiveComparisonTest.scala:370)


[info] - timestamp cast #1 *** FAILED ***
[info]   Results do not match for timestamp cast #1:
[info]   SELECT CAST(CAST(1 AS TIMESTAMP) AS DOUBLE) FROM src LIMIT 1
[info]   == Parsed Logical Plan ==
[info]   Limit 1
[info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
[info] UnresolvedRelation None, src, None
[info]   
[info]   == Analyzed Logical Plan ==
[info]   Limit 1
[info]Project [CAST(CAST(1, TimestampType), DoubleType) AS c_0#995]
[info] MetastoreRelation default, src, None
[info]   
[info]   == Optimized Logical Plan ==
[info]   Limit 1
[info]Project [0.0010 AS c_0#995]
[info] MetastoreRelation default, src, None
[info]   
[info]   == Physical Plan ==
[info]   Limit 1
[info]Project [0.0010 AS c_0#995]
[info] HiveTableScan [], (MetastoreRelation default, src, None), None
[info]   
[info]   Code Generation: false
[info]   == RDD ==
[info]   c_0
[info]   !== HIVE - 1 row(s) ==   == CATALYST - 1 row(s) ==
[info]   !0.001   0.0010 (HiveComparisonTest.scala:370)






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org