[jira] [Commented] (SPARK-3876) Doing a RDD map/reduce within a DStream map fails with a high enough input rate

2014-10-12 Thread Andrei Filip (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169019#comment-14169019
 ] 

Andrei Filip commented on SPARK-3876:
-

In a nutshell, the use case aims to parallelize many operations performed on 
the same input, and aggregate the outputs. This suggestion was given to me in 
this discussion: 
http://chat.stackoverflow.com/rooms/61251/discussion-between-smola-and-andrei 
(towards the end)

To be honest, the more fundamental question is whether spark streaming is 
actually appropriate for this sort of use case.

> Doing a RDD map/reduce within a DStream map fails with a high enough input 
> rate
> ---
>
> Key: SPARK-3876
> URL: https://issues.apache.org/jira/browse/SPARK-3876
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: Andrei Filip
>
> Having a custom receiver than generates random strings at custom rates: 
> JavaRandomSentenceReceiver
> A class that does work on a received string:
> class LengthGetter implements Serializable{
>   public int getStrLength(String s){
>   return s.length();
>   }
> }
> The following code:
> List objList = Arrays.asList(new LengthGetter(), new 
> LengthGetter(), new LengthGetter());
>   
>   final JavaRDD objRdd = sc.parallelize(objList);
>   
>   
>   JavaInputDStream sentences = jssc.receiverStream(new 
> JavaRandomSentenceReceiver(frequency));
>   
>   sentences.map(new Function() {
>   @Override
>   public Integer call(final String input) throws 
> Exception {
>   Integer res = objRdd.map(new 
> Function() {
>   @Override
>   public Integer call(LengthGetter lg) 
> throws Exception {
>   return lg.getStrLength(input);
>   }
>   }).reduce(new Function2 Integer>() {
>   
>   @Override
>   public Integer call(Integer left, 
> Integer right) throws Exception {
>   return left + right;
>   }
>   });
>   
>   
>   return res;
>   }   
>   }).print();
> fails for high enough frequencies with the following stack trace:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3.0:0 failed 1 times, most recent failure: Exception 
> failure in TID 3 on host localhost: java.lang.NullPointerException
> org.apache.spark.rdd.RDD.map(RDD.scala:270)
> org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:72)
> org.apache.spark.api.java.JavaRDD.map(JavaRDD.scala:29)
> Other information that might be useful is that my current batch duration is 
> set to 1sec and the frequencies for JavaRandomSentenceReceiver at which the 
> application fails are as low as 2Hz (1Hz for example works)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3921) WorkerWatcher in Standalone mode fail to come up due to invalid workerUrl

2014-10-12 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-3921:
--
Description: 
As of [this 
commit|https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153],
 standalone mode appears to have lost its WorkerWatcher, because of the swapped 
workerUrl and appId parameters. We still put workerUrl before appId when we 
start standalone executors, and the Executor misinterprets the appId as the 
workerUrl and fails to create the WorkerWatcher.

Note that this does not seem to crash the Standalone executor mode, despite the 
failing of the WorkerWatcher during its constructor.

  was:As of [this 
commit|https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153],
 standalone mode appears to be broken, because of the swapped workerUrl and 
appId parameters. We still put workerUrl before appId when we start standalone 
executors, and the Executor misinterprets the appId as the workerUrl and fails 
to create the WorkerWatcher.

Summary: WorkerWatcher in Standalone mode fail to come up due to 
invalid workerUrl  (was: Executors in Standalone mode fail to come up due to 
invalid workerUrl)

> WorkerWatcher in Standalone mode fail to come up due to invalid workerUrl
> -
>
> Key: SPARK-3921
> URL: https://issues.apache.org/jira/browse/SPARK-3921
> Project: Spark
>  Issue Type: Bug
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
>Priority: Critical
>
> As of [this 
> commit|https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153],
>  standalone mode appears to have lost its WorkerWatcher, because of the 
> swapped workerUrl and appId parameters. We still put workerUrl before appId 
> when we start standalone executors, and the Executor misinterprets the appId 
> as the workerUrl and fails to create the WorkerWatcher.
> Note that this does not seem to crash the Standalone executor mode, despite 
> the failing of the WorkerWatcher during its constructor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3899) wrong links in streaming doc

2014-10-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3899.
---
   Resolution: Fixed
Fix Version/s: 1.1.1
   1.2.0

Issue resolved by pull request 2749
[https://github.com/apache/spark/pull/2749]

> wrong links in streaming doc
> 
>
> Key: SPARK-3899
> URL: https://issues.apache.org/jira/browse/SPARK-3899
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0, 1.1.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3854) Scala style: require spaces before `{`

2014-10-12 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169003#comment-14169003
 ] 

Prashant Sharma commented on SPARK-3854:


For this you can use undocumented feature of scalastyle: 
https://github.com/scalastyle/scalastyle/commit/232d2661f68cdcd97193db366b6dc56fa844ad23.
 This is released in 0.5 version. So feel free to upgrade scalastyle version. 

> Scala style: require spaces before `{`
> --
>
> Key: SPARK-3854
> URL: https://issues.apache.org/jira/browse/SPARK-3854
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Josh Rosen
>
> We should require spaces before opening curly braces.  This isn't in the 
> style guide, but it probably should be:
> {code}
> // Correct:
> if (true) {
>   println("Wow!")
> }
> // Incorrect:
> if (true){
>println("Wow!")
> }
> {code}
> See https://github.com/apache/spark/pull/1658#discussion-diff-18611791 for an 
> example "in the wild."
> {{git grep "){"}} shows only a few occurrences of this style.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3334) Spark causes mesos-master memory leak

2014-10-12 Thread Iven Hsu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168999#comment-14168999
 ] 

Iven Hsu commented on SPARK-3334:
-

With spark 1.1.0, {{akkaFrameSize}} is same as other backends, reading from 
configuration. But the minimum value of it is 32000, and can't be set to 0, so 
it will still cause mesos-master to leak memory.

Anyone look into this?

> Spark causes mesos-master memory leak
> -
>
> Key: SPARK-3334
> URL: https://issues.apache.org/jira/browse/SPARK-3334
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.2
> Environment: Mesos 0.16.0/0.19.0
> CentOS 6.4
>Reporter: Iven Hsu
>
> The {{akkaFrameSize}} is set to {{Long.MaxValue}} in MesosBackend to 
> workaround SPARK-1112, this causes all serialized task result is sent using 
> Mesos TaskStatus.
> mesos-master stores TaskStatus in memory, and when running Spark, its memory 
> grows very fast, and will be OOM killed.
> See MESOS-1746 for more.
> I've tried to set {{akkaFrameSize}} to 0, mesos-master won't be killed, 
> however, the driver will block after success unless I use {{sc.stop()}} to 
> quit it manually. Not sure if it's related to SPARK-1112.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3838) Python code example for Word2Vec in user guide

2014-10-12 Thread Anant Daksh Asthana (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168989#comment-14168989
 ] 

Anant Daksh Asthana edited comment on SPARK-3838 at 10/13/14 6:22 AM:
--

Thanks [~mengxr] I will follow the instructions. I did also mention the coding 
guides are centered around Java/ Scala.


was (Author: slcclimber):
Thanks [~mengxr] I will follow the instructions. I did also mention the coding 
guides are centered around Java/ Scala. It would be nice to create one for 
Pyspark which colsely follows PEP-8.

> Python code example for Word2Vec in user guide
> --
>
> Key: SPARK-3838
> URL: https://issues.apache.org/jira/browse/SPARK-3838
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Anant Daksh Asthana
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3838) Python code example for Word2Vec in user guide

2014-10-12 Thread Anant Daksh Asthana (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168989#comment-14168989
 ] 

Anant Daksh Asthana edited comment on SPARK-3838 at 10/13/14 6:22 AM:
--

Thanks [~mengxr] I will follow the instructions. I did also mention the coding 
guides are centered around Java/ Scala. It would be nice to create one for 
Pyspark which colsely follows PEP-8.


was (Author: slcclimber):
Thanks [~mengxr] I will follow the instructions.

> Python code example for Word2Vec in user guide
> --
>
> Key: SPARK-3838
> URL: https://issues.apache.org/jira/browse/SPARK-3838
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Anant Daksh Asthana
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3838) Python code example for Word2Vec in user guide

2014-10-12 Thread Anant Daksh Asthana (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168989#comment-14168989
 ] 

Anant Daksh Asthana commented on SPARK-3838:


Thanks [~mengxr] I will follow the instructions.

> Python code example for Word2Vec in user guide
> --
>
> Key: SPARK-3838
> URL: https://issues.apache.org/jira/browse/SPARK-3838
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Anant Daksh Asthana
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3921) Executors in Standalone mode fail to come up due to invalid workerUrl

2014-10-12 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-3921:
-

 Summary: Executors in Standalone mode fail to come up due to 
invalid workerUrl
 Key: SPARK-3921
 URL: https://issues.apache.org/jira/browse/SPARK-3921
 Project: Spark
  Issue Type: Bug
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical


As of [this 
commit|https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153],
 standalone mode appears to be broken, because of the swapped workerUrl and 
appId parameters. We still put workerUrl before appId when we start standalone 
executors, and the Executor misinterprets the appId as the workerUrl and fails 
to create the WorkerWatcher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3905) The keys for sorting the columns of Executor page ,Stage page Storage page are incorrect

2014-10-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3905:
--
Fix Version/s: 1.2.0
   1.1.1

> The keys for sorting the columns of Executor page ,Stage page Storage page  
> are incorrect
> -
>
> Key: SPARK-3905
> URL: https://issues.apache.org/jira/browse/SPARK-3905
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
> Fix For: 1.1.1, 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3905) The keys for sorting the columns of Executor page ,Stage page Storage page are incorrect

2014-10-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3905.
---
Resolution: Fixed

Fixed by https://github.com/apache/spark/pull/2763

> The keys for sorting the columns of Executor page ,Stage page Storage page  
> are incorrect
> -
>
> Key: SPARK-3905
> URL: https://issues.apache.org/jira/browse/SPARK-3905
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
> Fix For: 1.1.1, 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3905) The keys for sorting the columns of Executor page ,Stage page Storage page are incorrect

2014-10-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3905:
--
Affects Version/s: 1.0.2
   1.1.0

> The keys for sorting the columns of Executor page ,Stage page Storage page  
> are incorrect
> -
>
> Key: SPARK-3905
> URL: https://issues.apache.org/jira/browse/SPARK-3905
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
> Fix For: 1.1.1, 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3849) Automate remaining Spark Code Style Guide rules

2014-10-12 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168961#comment-14168961
 ] 

Matei Zaharia commented on SPARK-3849:
--

Just to comment the same thing I did on the mailing list -- I'm against adding 
such changes if they require a large sweep through all the files in the 
project. That slows down development for everyone who has a waiting patch and 
makes backporting patches to old versions of Spark much harder. This is simply 
not worth the slight improvement in code style (the code style is actually 
pretty consistent as is). If you find a way to make it apply only to new code 
when Jenkins tests it, that would be fine.

> Automate remaining Spark Code Style Guide rules
> ---
>
> Key: SPARK-3849
> URL: https://issues.apache.org/jira/browse/SPARK-3849
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Nicholas Chammas
>
> Style problems continue to take up a large amount of review time, mostly 
> because there are many [Spark Code Style 
> Guide|https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide]
>  rules that have not been automated.
> This issue tracks the remaining rules that have not automated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3838) Python code example for Word2Vec in user guide

2014-10-12 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168960#comment-14168960
 ] 

Xiangrui Meng commented on SPARK-3838:
--

[~slcclimber] Thanks! Please follow instructions at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark and 
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide .

> Python code example for Word2Vec in user guide
> --
>
> Key: SPARK-3838
> URL: https://issues.apache.org/jira/browse/SPARK-3838
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Anant Daksh Asthana
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3838) Python code example for Word2Vec in user guide

2014-10-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3838:
-
Assignee: Anant Daksh Asthana  (was: Liquan Pei)

> Python code example for Word2Vec in user guide
> --
>
> Key: SPARK-3838
> URL: https://issues.apache.org/jira/browse/SPARK-3838
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Anant Daksh Asthana
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3869) ./bin/spark-class miss Java version with _JAVA_OPTIONS set

2014-10-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168951#comment-14168951
 ] 

Patrick Wendell commented on SPARK-3869:


Hey [~cocoatomo] would you mind updating your JIRA account with a properly 
formatted name (e.g. "First Last"). It is difficult for us when nicknames are 
used when we want to write release notes. If you'd prefer to contribute 
anonymously, it's fine to just make up a name.

> ./bin/spark-class miss Java version with _JAVA_OPTIONS set
> --
>
> Key: SPARK-3869
> URL: https://issues.apache.org/jira/browse/SPARK-3869
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.2.0
> Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20
>Reporter: cocoatomo
>
> When _JAVA_OPTIONS environment variable is set, a command "java -version" 
> outputs a message like "Picked up _JAVA_OPTIONS: -Dfile.encoding=UTF-8".
> ./bin/spark-class knows java version from the first line of "java -version" 
> output, so it mistakes java version with _JAVA_OPTIONS set.
> commit: a85f24accd3266e0f97ee04d03c22b593d99c062



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3121) Wrong implementation of implicit bytesWritableConverter

2014-10-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3121.
---
   Resolution: Fixed
Fix Version/s: 1.0.3
   1.1.1
   1.2.0

Issue resolved by pull request 2712
[https://github.com/apache/spark/pull/2712]

> Wrong implementation of implicit bytesWritableConverter
> ---
>
> Key: SPARK-3121
> URL: https://issues.apache.org/jira/browse/SPARK-3121
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: Jakub Dubovsky
>Priority: Critical
> Fix For: 1.2.0, 1.1.1, 1.0.3
>
>
> val path = ... //path to seq file with BytesWritable as type of both key and 
> value
> val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)
> file.take(1)(0)._1
> This prints incorrect content of byte array. Actual content starts with 
> correct one and some "random" bytes and zeros are appended. BytesWritable has 
> two methods:
> getBytes() - return content of all internal array which is often longer then 
> actual value stored. It usually contains the rest of previous longer values
> copyBytes() - return just begining of internal array determined by internal 
> length property
> It looks like in implicit conversion between BytesWritable and Array[byte] 
> getBytes is used instead of correct copyBytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3897) Scala style: format example code

2014-10-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168949#comment-14168949
 ] 

Patrick Wendell commented on SPARK-3897:


Hey [~sjk] - can you modify your JIRA account to have a properly formatted name 
(e.g. "First Last")? Right now the full name is just "sjk" - this makes it 
difficult for us when we are writing release notes and have to go look up 
people's identities.

> Scala style: format example code
> 
>
> Key: SPARK-3897
> URL: https://issues.apache.org/jira/browse/SPARK-3897
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: sjk
>
> https://github.com/apache/spark/pull/2754



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3897) Scala style: format example code

2014-10-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168949#comment-14168949
 ] 

Patrick Wendell edited comment on SPARK-3897 at 10/13/14 5:04 AM:
--

Hey [~shijinkui] - can you modify your JIRA account to have a properly 
formatted name (e.g. "First Last")? Right now the full name is just "sjk" - 
this makes it difficult for us when we are writing release notes and have to go 
look up people's identities.


was (Author: pwendell):
Hey [~shijinkui]] - can you modify your JIRA account to have a properly 
formatted name (e.g. "First Last")? Right now the full name is just "sjk" - 
this makes it difficult for us when we are writing release notes and have to go 
look up people's identities.

> Scala style: format example code
> 
>
> Key: SPARK-3897
> URL: https://issues.apache.org/jira/browse/SPARK-3897
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: sjk
>
> https://github.com/apache/spark/pull/2754



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3897) Scala style: format example code

2014-10-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168949#comment-14168949
 ] 

Patrick Wendell edited comment on SPARK-3897 at 10/13/14 5:04 AM:
--

Hey [~shijinkui]] - can you modify your JIRA account to have a properly 
formatted name (e.g. "First Last")? Right now the full name is just "sjk" - 
this makes it difficult for us when we are writing release notes and have to go 
look up people's identities.


was (Author: pwendell):
Hey [~sjk] - can you modify your JIRA account to have a properly formatted name 
(e.g. "First Last")? Right now the full name is just "sjk" - this makes it 
difficult for us when we are writing release notes and have to go look up 
people's identities.

> Scala style: format example code
> 
>
> Key: SPARK-3897
> URL: https://issues.apache.org/jira/browse/SPARK-3897
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: sjk
>
> https://github.com/apache/spark/pull/2754



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-10-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168948#comment-14168948
 ] 

Patrick Wendell commented on SPARK-3431:


If we can get the maven build times down to be similar or less than that of 
SBT, I'd prefer to use it to run the tests. So looking at parallel test 
execution in Maven would be great.

> Parallelize execution of tests
> --
>
> Key: SPARK-3431
> URL: https://issues.apache.org/jira/browse/SPARK-3431
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>
> Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
> strategy to cut test time down is to parallelize the execution of the tests. 
> Doing that may in turn require some prerequisite changes to be made to how 
> certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3915) backport 'spark.localExecution.enabled' to 1.0

2014-10-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168947#comment-14168947
 ] 

Patrick Wendell commented on SPARK-3915:


This option won't be relevant to 1.0 I think - we only introduced this option 
in SPARK-3029 which was fixed in Spark 1.1. In 1.0 we will allow local 
execution by default.

/cc [~adav] who added this feature.

> backport 'spark.localExecution.enabled' to 1.0
> --
>
> Key: SPARK-3915
> URL: https://issues.apache.org/jira/browse/SPARK-3915
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Davies Liu
>
> discussion: 
> http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-td16059.html
> cc [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3121) Wrong implementation of implicit bytesWritableConverter

2014-10-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3121:
--
Priority: Critical  (was: Minor)

> Wrong implementation of implicit bytesWritableConverter
> ---
>
> Key: SPARK-3121
> URL: https://issues.apache.org/jira/browse/SPARK-3121
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: Jakub Dubovsky
>Priority: Critical
>
> val path = ... //path to seq file with BytesWritable as type of both key and 
> value
> val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)
> file.take(1)(0)._1
> This prints incorrect content of byte array. Actual content starts with 
> correct one and some "random" bytes and zeros are appended. BytesWritable has 
> two methods:
> getBytes() - return content of all internal array which is often longer then 
> actual value stored. It usually contains the rest of previous longer values
> copyBytes() - return just begining of internal array determined by internal 
> length property
> It looks like in implicit conversion between BytesWritable and Array[byte] 
> getBytes is used instead of correct copyBytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3121) Wrong implementation of implicit bytesWritableConverter

2014-10-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3121:
--
Component/s: Spark Core

> Wrong implementation of implicit bytesWritableConverter
> ---
>
> Key: SPARK-3121
> URL: https://issues.apache.org/jira/browse/SPARK-3121
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: Jakub Dubovsky
>Priority: Critical
>
> val path = ... //path to seq file with BytesWritable as type of both key and 
> value
> val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)
> file.take(1)(0)._1
> This prints incorrect content of byte array. Actual content starts with 
> correct one and some "random" bytes and zeros are appended. BytesWritable has 
> two methods:
> getBytes() - return content of all internal array which is often longer then 
> actual value stored. It usually contains the rest of previous longer values
> copyBytes() - return just begining of internal array determined by internal 
> length property
> It looks like in implicit conversion between BytesWritable and Array[byte] 
> getBytes is used instead of correct copyBytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3121) Wrong implementation of implicit bytesWritableConverter

2014-10-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3121:
--
Affects Version/s: 1.2.0
   1.1.0

> Wrong implementation of implicit bytesWritableConverter
> ---
>
> Key: SPARK-3121
> URL: https://issues.apache.org/jira/browse/SPARK-3121
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: Jakub Dubovsky
>Priority: Minor
>
> val path = ... //path to seq file with BytesWritable as type of both key and 
> value
> val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)
> file.take(1)(0)._1
> This prints incorrect content of byte array. Actual content starts with 
> correct one and some "random" bytes and zeros are appended. BytesWritable has 
> two methods:
> getBytes() - return content of all internal array which is often longer then 
> actual value stored. It usually contains the rest of previous longer values
> copyBytes() - return just begining of internal array determined by internal 
> length property
> It looks like in implicit conversion between BytesWritable and Array[byte] 
> getBytes is used instead of correct copyBytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7

2014-10-12 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168931#comment-14168931
 ] 

Josh Rosen commented on SPARK-922:
--

[~nchammas]: It would be great to include Python 2.7 in the next AMI; I think 
our current AMI shell script has it, though.

[~aedwip]:

{quote}
I have not figured out how to use pssh with yum. yum prompts you y/n before 
downloading
{quote}

Try {{yum install -y}}.

> Update Spark AMI to Python 2.7
> --
>
> Key: SPARK-922
> URL: https://issues.apache.org/jira/browse/SPARK-922
> Project: Spark
>  Issue Type: Task
>  Components: EC2, PySpark
>Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.1.0
>Reporter: Josh Rosen
>
> Many Python libraries only support Python 2.7+, so we should make Python 2.7 
> the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3920) Add option to support aggregation using treeAggregate in decision tree

2014-10-12 Thread Qiping Li (JIRA)
Qiping Li created SPARK-3920:


 Summary: Add option to support aggregation using treeAggregate in 
decision tree
 Key: SPARK-3920
 URL: https://issues.apache.org/jira/browse/SPARK-3920
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Qiping Li
 Fix For: 1.2.0


In [SPARK-3366|https://issues.apache.org/jira/browse/SPARK-3366], we used 
distribute aggregation to aggregate node stats, which can save computation and 
communication time when the shuffle size is very large. But experiments have 
shown that if shuffle size is not large enough(e.g, shallow trees), this will 
cause some performance loss(greater than 20% in some cases). We should support 
both options for aggregation so that user can choose a proper one based on 
their needs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2014-10-12 Thread Daniel Erenrich (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168842#comment-14168842
 ] 

Daniel Erenrich commented on SPARK-3080:


I ran into a similar bug. My datascale is smaller (only a few million entries 
on 5 r3.xlarges) but subsampling causes the bug to go away as does lowering the 
number of iterations. My dataset is public so if needed I could provide a test 
case.

I'll probably look into this when I have some free time.

> ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-3080
> URL: https://issues.apache.org/jira/browse/SPARK-3080
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Burak Yavuz
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3716) Change partitionStrategy to utilize PartitionStrategy.fromString(_) to match edgeStorageLevel and vertexStorageLevel syntax in Analytics.scala

2014-10-12 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-3716.
---
   Resolution: Fixed
Fix Version/s: 1.1.1
   1.2.0

Issue resolved by pull request 2569
[https://github.com/apache/spark/pull/2569]

> Change partitionStrategy to utilize PartitionStrategy.fromString(_) to match 
> edgeStorageLevel and vertexStorageLevel syntax in Analytics.scala
> --
>
> Key: SPARK-3716
> URL: https://issues.apache.org/jira/browse/SPARK-3716
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.1.0
>Reporter: Benjamin Piering
>Priority: Trivial
> Fix For: 1.2.0, 1.1.1
>
>
> Currently the Analytics.scala page has its own function which is a copy of 
> the PartitionStrategy object's fromString() method. This can be removed, and 
> the PartitionStrategy.fromString(_) method can be called in the .map() method 
> in the creation of the partitionStrategy val. This matches the syntax used in 
> the edgeStorageLevel and vertexStorageLevel declarations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7

2014-10-12 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168788#comment-14168788
 ] 

Andrew Davidson commented on SPARK-922:
---

also forgot the mention there are a couple of steps on 
http://nbviewer.ipython.org/gist/JoshRosen/6856670

that are important in the upgrade process

#
# restart spark
#
/root/spark/sbin/stop-all.sh
/root/spark/sbin/start-all.sh


> Update Spark AMI to Python 2.7
> --
>
> Key: SPARK-922
> URL: https://issues.apache.org/jira/browse/SPARK-922
> Project: Spark
>  Issue Type: Task
>  Components: EC2, PySpark
>Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.1.0
>Reporter: Josh Rosen
>
> Many Python libraries only support Python 2.7+, so we should make Python 2.7 
> the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7

2014-10-12 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168787#comment-14168787
 ] 

Andrew Davidson commented on SPARK-922:
---

Wow upgrading matplotlib was a bear. The following worked for me. The trick was 
getting the correct version of the source code. The recipe bellow is not 100% 
correct. I have not figured out how to use pssh with yum. yum prompts you y/n 
before downloading 

pip2.7 install six
pssh -t0 -h /root/spark-ec2/slaves pip2.7 install six

pip2.7 install python-dateutil
pssh -t0 -h /root/spark-ec2/slaves pip2.7 install python-dateutil

pip2.7 install pyparsing
pssh -t0 -h /root/spark-ec2/slaves pip2.7 install pyparsing


yum install yum-utils

wget https://github.com/matplotlib/matplotlib/archive/master.tar.gz
tar -zxvf master.tar.gz
cd matplotlib-master/
yum install freetype-devel
yum install libpng-devel
python2.7 setup.py build
python2.7 setup.py install

> Update Spark AMI to Python 2.7
> --
>
> Key: SPARK-922
> URL: https://issues.apache.org/jira/browse/SPARK-922
> Project: Spark
>  Issue Type: Task
>  Components: EC2, PySpark
>Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.1.0
>Reporter: Josh Rosen
>
> Many Python libraries only support Python 2.7+, so we should make Python 2.7 
> the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution

2014-10-12 Thread cocoatomo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168786#comment-14168786
 ] 

cocoatomo commented on SPARK-3910:
--

Thank you for the comment.

I am trying it at $SPARK_HOME. (Executing "./bin/run-tests" command shows this.)
In addition, it is strange that a command
{noformat}
./bin/pyspark python/pyspark/mllib/classification.py
{noformat}
fails with "numpy ImportError".
So, my environment have some trouble (sys.path is suspicious) and at least we 
have some difference between environments where PySpark runs.

I set up my environment using virtualenvwrapper with Python 2.6.8 (default 
python executable on Mac OS X 10.9.5).
ImportError mentioned in this issue occurred on this environment.
For comparison, I tried testing on other environment which Python version is 
2.7.8, then got a same error.

Is there some difference between our environments?

> ./python/pyspark/mllib/classification.py doctests fails with module name 
> pollution
> --
>
> Key: SPARK-3910
> URL: https://issues.apache.org/jira/browse/SPARK-3910
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, 
> Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, 
> argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, 
> pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, 
> unittest2==0.5.1, wsgiref==0.1.2
>Reporter: cocoatomo
>  Labels: pyspark, testing
>
> In ./python/run-tests script, we run the doctests in 
> ./pyspark/mllib/classification.py.
> The output is as following:
> {noformat}
> $ ./python/run-tests
> ...
> Running test: pyspark/mllib/classification.py
> Traceback (most recent call last):
>   File "pyspark/mllib/classification.py", line 20, in 
> import numpy
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py",
>  line 170, in 
> from . import add_newdocs
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py",
>  line 13, in 
> from numpy.lib import add_newdoc
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py",
>  line 8, in 
> from .type_check import *
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py",
>  line 11, in 
> import numpy.core.numeric as _nx
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py",
>  line 46, in 
> from numpy.testing import Tester
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py",
>  line 13, in 
> from .utils import *
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py",
>  line 15, in 
> from tempfile import mkdtemp
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py",
>  line 34, in 
> from random import Random as _Random
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py", 
> line 24, in 
> from pyspark.rdd import RDD
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py", line 
> 51, in 
> from pyspark.context import SparkContext
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py", line 
> 22, in 
> from tempfile import NamedTemporaryFile
> ImportError: cannot import name NamedTemporaryFile
> 0.07 real 0.04 user 0.02 sys
> Had test failures; see logs.
> {noformat}
> The problem is a cyclic import of tempfile module.
> The cause of it is that pyspark.mllib.random module exists in the directory 
> where pyspark.mllib.classification module exists.
> classification module imports numpy module, and then numpy module imports 
> tempfile module from its inside.
> Now the first entry sys.path is the directory "./python/pyspark/mllib" (where 
> the executed file "classification.py" exists), so tempfile module imports 
> pyspark.mllib.random module (not the standard library "random" module).
> Finally, import chains reach tempfile again, then a cyclic import is formed.
> Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile 
> → (cyclic import!!)
> Furthermore, stat module is in a standard library, and pyspark.mllib.stat 
> module exists. This also may be troublesome.
> commit: 0e8203f4fb721158fb27897680da476174d24c4b
> A fundamental solution is to avoid using module names used by standard 
> libraries (currently "random" and "stat").
> A difficulty of this solution is to rename pyspark.mllib.random and 

[jira] [Resolved] (SPARK-3887) ConnectionManager should log remote exception when reporting remote errors

2014-10-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3887.

   Resolution: Fixed
Fix Version/s: 1.2.0

> ConnectionManager should log remote exception when reporting remote errors
> --
>
> Key: SPARK-3887
> URL: https://issues.apache.org/jira/browse/SPARK-3887
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.2.0
>
>
> When reporting that a remote error occurred, the ConnectionManager should 
> also log the stacktrace of the remote exception.  This can be accomplished by 
> sending the remote exception's stacktrace as the payload in the "negative ACK 
> / error message" that's sent by the error-handling code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3445) Deprecate and later remove YARN alpha support

2014-10-12 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168753#comment-14168753
 ] 

Andrew Or edited comment on SPARK-3445 at 10/12/14 7:17 PM:


Yes looks like there are a few API difference across the different alpha 
branches (0.23.* and 2.0.*). I have opened a PR to fix this: 
https://github.com/apache/spark/pull/2776. Thanks for reporting.


was (Author: andrewor14):
Yes I have opened a PR to fix this: https://github.com/apache/spark/pull/2776. 
Thanks for reporting.

> Deprecate and later remove YARN alpha support
> -
>
> Key: SPARK-3445
> URL: https://issues.apache.org/jira/browse/SPARK-3445
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Patrick Wendell
>
> This will depend a bit on both user demand and the commitment level of 
> maintainers, but I'd like to propose the following timeline for yarn-alpha 
> support.
> Spark 1.2: Deprecate YARN-alpha
> Spark 1.3: Remove YARN-alpha (i.e. require YARN-stable)
> Since YARN-alpha is clearly identified as an alpha API, it seems reasonable 
> to drop support for it in a minor release. However, it does depend a bit 
> whether anyone uses this outside of Yahoo!, and that I'm not sure of. In the 
> past this API has been used and maintained by Yahoo, but they'll be migrating 
> soon to the stable API's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3445) Deprecate and later remove YARN alpha support

2014-10-12 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168754#comment-14168754
 ] 

Andrew Or commented on SPARK-3445:
--

Looks like there are a few API difference across the different alpha branches 
(0.23.* and 2.0.*)

> Deprecate and later remove YARN alpha support
> -
>
> Key: SPARK-3445
> URL: https://issues.apache.org/jira/browse/SPARK-3445
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Patrick Wendell
>
> This will depend a bit on both user demand and the commitment level of 
> maintainers, but I'd like to propose the following timeline for yarn-alpha 
> support.
> Spark 1.2: Deprecate YARN-alpha
> Spark 1.3: Remove YARN-alpha (i.e. require YARN-stable)
> Since YARN-alpha is clearly identified as an alpha API, it seems reasonable 
> to drop support for it in a minor release. However, it does depend a bit 
> whether anyone uses this outside of Yahoo!, and that I'm not sure of. In the 
> past this API has been used and maintained by Yahoo, but they'll be migrating 
> soon to the stable API's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3445) Deprecate and later remove YARN alpha support

2014-10-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3445:
-
Comment: was deleted

(was: Looks like there are a few API difference across the different alpha 
branches (0.23.* and 2.0.*))

> Deprecate and later remove YARN alpha support
> -
>
> Key: SPARK-3445
> URL: https://issues.apache.org/jira/browse/SPARK-3445
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Patrick Wendell
>
> This will depend a bit on both user demand and the commitment level of 
> maintainers, but I'd like to propose the following timeline for yarn-alpha 
> support.
> Spark 1.2: Deprecate YARN-alpha
> Spark 1.3: Remove YARN-alpha (i.e. require YARN-stable)
> Since YARN-alpha is clearly identified as an alpha API, it seems reasonable 
> to drop support for it in a minor release. However, it does depend a bit 
> whether anyone uses this outside of Yahoo!, and that I'm not sure of. In the 
> past this API has been used and maintained by Yahoo, but they'll be migrating 
> soon to the stable API's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3445) Deprecate and later remove YARN alpha support

2014-10-12 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168754#comment-14168754
 ] 

Andrew Or edited comment on SPARK-3445 at 10/12/14 7:17 PM:


Looks like there are a few API difference across the different alpha branches 
(0.23.* and 2.0.*)


was (Author: andrewor14):
Looks like there are a few API difference across the different alpha branches 
(0.23.* and 2.0.*)

> Deprecate and later remove YARN alpha support
> -
>
> Key: SPARK-3445
> URL: https://issues.apache.org/jira/browse/SPARK-3445
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Patrick Wendell
>
> This will depend a bit on both user demand and the commitment level of 
> maintainers, but I'd like to propose the following timeline for yarn-alpha 
> support.
> Spark 1.2: Deprecate YARN-alpha
> Spark 1.3: Remove YARN-alpha (i.e. require YARN-stable)
> Since YARN-alpha is clearly identified as an alpha API, it seems reasonable 
> to drop support for it in a minor release. However, it does depend a bit 
> whether anyone uses this outside of Yahoo!, and that I'm not sure of. In the 
> past this API has been used and maintained by Yahoo, but they'll be migrating 
> soon to the stable API's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3445) Deprecate and later remove YARN alpha support

2014-10-12 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168753#comment-14168753
 ] 

Andrew Or commented on SPARK-3445:
--

Yes I have opened a PR to fix this: https://github.com/apache/spark/pull/2776. 
Thanks for reporting.

> Deprecate and later remove YARN alpha support
> -
>
> Key: SPARK-3445
> URL: https://issues.apache.org/jira/browse/SPARK-3445
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Patrick Wendell
>
> This will depend a bit on both user demand and the commitment level of 
> maintainers, but I'd like to propose the following timeline for yarn-alpha 
> support.
> Spark 1.2: Deprecate YARN-alpha
> Spark 1.3: Remove YARN-alpha (i.e. require YARN-stable)
> Since YARN-alpha is clearly identified as an alpha API, it seems reasonable 
> to drop support for it in a minor release. However, it does depend a bit 
> whether anyone uses this outside of Yahoo!, and that I'm not sure of. In the 
> past this API has been used and maintained by Yahoo, but they'll be migrating 
> soon to the stable API's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-10-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168727#comment-14168727
 ] 

Patrick Wendell commented on SPARK-2593:


Yeah for Spark Streaming the API visibility is not an issue because we are 
explicitly exposing Akka as an API (and it's an add-on connector). I still 
don't quite understand if the model of akka is that different applications are 
supposed to share actor systems. We rely heavily on customization of the akka 
configuration and if a user passes in an ActorSystem it will break assumptions 
inside of Spark - so I think allowing users to pass in an ActorSystem is going 
to be difficult. Exposing our ActorSystem in Spark Streaming seems more 
reasonable though, since the configurations are immutable at that point. Other 
things like providing better naming, etc, that stuff makes a lot of sense.

> Add ability to pass an existing Akka ActorSystem into Spark
> ---
>
> Key: SPARK-2593
> URL: https://issues.apache.org/jira/browse/SPARK-2593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Helena Edelson
>
> As a developer I want to pass an existing ActorSystem into StreamingContext 
> in load-time so that I do not have 2 actor systems running on a node in an 
> Akka application.
> This would mean having spark's actor system on its own named-dispatchers as 
> well as exposing the new private creation of its own actor system.
>   
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2075) Anonymous classes are missing from Spark distribution

2014-10-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2075:
---
Summary: Anonymous classes are missing from Spark distribution  (was: 
Hadoop1 distribution of 1.0.0 does not contain classes expected by the Maven 
1.0.0 artifact)

> Anonymous classes are missing from Spark distribution
> -
>
> Key: SPARK-2075
> URL: https://issues.apache.org/jira/browse/SPARK-2075
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core
>Affects Versions: 1.0.0
>Reporter: Paul R. Brown
>Priority: Critical
> Fix For: 1.0.1
>
>
> Running a job built against the Maven dep for 1.0.0 and the hadoop1 
> distribution produces:
> {code}
> java.lang.ClassNotFoundException:
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1
> {code}
> Here's what's in the Maven dep as of 1.0.0:
> {code}
> jar tvf 
> ~/.m2/repository/org/apache/spark/spark-core_2.10/1.0.0/spark-core_2.10-1.0.0.jar
>  | grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 13:57:58 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}
> And here's what's in the hadoop1 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop1.0.4.jar| grep 'rdd/RDD' | grep 'saveAs'
> {code}
> I.e., it's not there.  It is in the hadoop2 distribution:
> {code}
> jar tvf spark-assembly-1.0.0-hadoop2.2.0.jar| grep 'rdd/RDD' | grep 'saveAs'
>   1519 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class
>   1560 Mon May 26 07:29:54 PDT 2014 
> org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$2.class
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3919) HiveThriftServer2 fails to start because of Hive 0.12 metastore schema verification failure

2014-10-12 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168664#comment-14168664
 ] 

Cheng Lian commented on SPARK-3919:
---

[~pwendell] Hive Metastore schema verification requires the version string to 
be exactly the same (except the {{-SNAPSHOT}} suffix).

> HiveThriftServer2 fails to start because of Hive 0.12 metastore schema 
> verification failure
> ---
>
> Key: SPARK-3919
> URL: https://issues.apache.org/jira/browse/SPARK-3919
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
>
> When using MySQL backed Metastore with {{hive.metastore.schema.verification}} 
> set to {{true}}, HiveThriftServer2 fails to start:
> {code}
> 14/10/12 17:05:01 ERROR HiveThriftServer2: Error starting HiveThriftServer2
> org.apache.hive.service.ServiceException: Failed to Start HiveServer2
>   at 
> org.apache.hive.service.CompositeService.start(CompositeService.java:80)
>   at org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:73)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:335)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.hive.service.ServiceException: Unable to connect to 
> MetaStore!
>   at org.apache.hive.service.cli.CLIService.start(CLIService.java:85)
>   at 
> org.apache.hive.service.CompositeService.start(CompositeService.java:70)
>   ... 10 more
> Caused by: MetaException(message:Hive Schema version 0.12.0-protobuf-2.5 does 
> not match metastore's schema version 0.12.0 Metastore is not upgraded or 
> corrupt)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:5651)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:5622)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124)
>   at com.sun.proxy.$Proxy11.verifySchema(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:403)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:441)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:326)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.(HiveMetaStore.java:286)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:54)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4060)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:121)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:104)
>   at org.apache.hive.service.cli.CLIService.start(CLIService.java:82)
>   ... 11 more
> {code}
> Seems that recent Akka/Protobuf dependency changes are related to this.
> A valid workaround is to set {{hive.metastore.schema.verification}} to 
> {{false}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3919) HiveThriftServer2 fails to start because of Hive 0.12 metastore schema verification failure

2014-10-12 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3919:
--
Description: 
When using MySQL backed Metastore with {{hive.metastore.schema.verification}} 
set to {{true}}, HiveThriftServer2 fails to start:
{code}
14/10/12 17:05:01 ERROR HiveThriftServer2: Error starting HiveThriftServer2
org.apache.hive.service.ServiceException: Failed to Start HiveServer2
at 
org.apache.hive.service.CompositeService.start(CompositeService.java:80)
at org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:73)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:335)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hive.service.ServiceException: Unable to connect to 
MetaStore!
at org.apache.hive.service.cli.CLIService.start(CLIService.java:85)
at 
org.apache.hive.service.CompositeService.start(CompositeService.java:70)
... 10 more
Caused by: MetaException(message:Hive Schema version 0.12.0-protobuf-2.5 does 
not match metastore's schema version 0.12.0 Metastore is not upgraded or 
corrupt)
at 
org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:5651)
at 
org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:5622)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124)
at com.sun.proxy.$Proxy11.verifySchema(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:403)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:441)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:326)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.(HiveMetaStore.java:286)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:54)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4060)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:121)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:104)
at org.apache.hive.service.cli.CLIService.start(CLIService.java:82)
... 11 more
{code}
Seems that recent Akka/Protobuf dependency changes are related to this.

A valid workaround is to set {{hive.metastore.schema.verification}} to 
{{false}}.

  was:
When using MySQL backed Metastore with {{hive.metastore.schema.verification}} 
set to {{true}}, HiveThriftServer2 fails to start:
{code}
14/10/12 17:05:01 ERROR HiveThriftServer2: Error starting HiveThriftServer2
org.apache.hive.service.ServiceException: Failed to Start HiveServer2
at 
org.apache.hive.service.CompositeService.start(CompositeService.java:80)
at org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:73)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:335)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hive.service.ServiceException: Unable to connect to 
MetaStore!
at org.apache.hive.service.cli.CLIService.start(CLIService.jav

[jira] [Created] (SPARK-3919) HiveThriftServer2 fails to start because of Hive 0.12 metastore schema verification failure

2014-10-12 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-3919:
-

 Summary: HiveThriftServer2 fails to start because of Hive 0.12 
metastore schema verification failure
 Key: SPARK-3919
 URL: https://issues.apache.org/jira/browse/SPARK-3919
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian


When using MySQL backed Metastore with {{hive.metastore.schema.verification}} 
set to {{true}}, HiveThriftServer2 fails to start:
{code}
14/10/12 17:05:01 ERROR HiveThriftServer2: Error starting HiveThriftServer2
org.apache.hive.service.ServiceException: Failed to Start HiveServer2
at 
org.apache.hive.service.CompositeService.start(CompositeService.java:80)
at org.apache.hive.service.server.HiveServer2.start(HiveServer2.java:73)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:84)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:335)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hive.service.ServiceException: Unable to connect to 
MetaStore!
at org.apache.hive.service.cli.CLIService.start(CLIService.java:85)
at 
org.apache.hive.service.CompositeService.start(CompositeService.java:70)
... 10 more
Caused by: MetaException(message:Hive Schema version 0.12.0-protobuf-2.5 does 
not match metastore's schema version 0.12.0 Metastore is not upgraded or 
corrupt)
at 
org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:5651)
at 
org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:5622)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingRawStore.invoke(RetryingRawStore.java:124)
at com.sun.proxy.$Proxy11.verifySchema(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:403)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:441)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:326)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.(HiveMetaStore.java:286)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:54)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4060)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:121)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:104)
at org.apache.hive.service.cli.CLIService.start(CLIService.java:82)
... 11 more
{code}
Seems that recent Akka/Protobuf dependency changes are related to this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3918) Forget Unpersist in RandomForest.scala(train Method)

2014-10-12 Thread junlong (JIRA)
junlong created SPARK-3918:
--

 Summary: Forget Unpersist in RandomForest.scala(train Method)
 Key: SPARK-3918
 URL: https://issues.apache.org/jira/browse/SPARK-3918
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
 Environment: All
Reporter: junlong
 Fix For: 1.1.0


   In version 1.1.0 DecisionTree.scala, train Method, treeInput has been 
persisted in Memory, but without unpersist. It caused heavy DISK usage.
   In github version(1.2.0 maybe), RandomForest.scala, train Method, 
baggedInput has been persisted but without unpersisted too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3917) Compress data before network transfer

2014-10-12 Thread junlong (JIRA)
junlong created SPARK-3917:
--

 Summary: Compress data before network transfer
 Key: SPARK-3917
 URL: https://issues.apache.org/jira/browse/SPARK-3917
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: All
Reporter: junlong
Priority: Critical
 Fix For: 1.1.0


When training Gradient Boosting Decision Tree on large sparse data, heavy 
network flow pull down CPU utilization ratio. And through compression on 
network flow data, 90% are reduced. 
So maybe compression before transfering may provide higher speedup on 
spark. And user can configure it to decide whether compress or not.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

2014-10-12 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168590#comment-14168590
 ] 

sam commented on SPARK-1473:


[~torito1984] Thank you for the response, and apologies for my delay in 
responding.

Yes the problems of trying to estimate probabilities when independence 
assumptions are not made indeed make it necessary to consider some features 
independent.  My question is *how* should we do this? Is there any literature 
that has attempted to **formalize the way we introduce independence** in 
*information theoretic* terms.  Moreover I see this problem, and feature 
selection in general, as problems that are tightly coupled with the way 
probability estimation is performed.

Suppose in the simplest case we wish to decide whether features F_1 and F_2 are 
dependent (we could consider arbitrary conjunctions too). Then the Information 
Theorist would want to consider the Mutual Information, i.e. the KL between the 
joint and product of marginals:

KL( p(F_1, F_2) || p(F_1) * p(F_2) )

Then use a threshold or rank on feature pairs to determine whether to consider 
them dependent. 

Now this is where we are tightly coupled with the means by which we estimate 
the probabilities p(F_1, F_2), p(F_1) and p(F_2).  We could use Maximum 
Liklihood with Laplace Smoothing, MAP / Regularization, etc, or the much lesser 
known Carnap's Continuum of Inductive Methods.  Which method we choose along 
with the usual arbitrary choice of some constant (e.g. alpha in 
Laplace/Additive Smoothing) will determine p(F_1, F_2), p(F_1) and p(F_2) and 
therefore determine whether or not F_1 & F_2 are to be considered dependent.

The current practice in Machine Learning has been to choose a method of 
estimation based off x-validation results rather than some deep philosophical 
justification.  Prof' Jeff Paris's work and his colleagues is the only work 
I've seen that attempts to use Information Theoretic principles to estimate 
probabilities.  Unfortunately the work is a little incomplete with regard to 
practical application.

To summarize, although I like the paper, especially it's principled approach 
(vs the "just test and see" commonly seen in Data Science), how independence is 
to be assumed (to solve the exponential sparsity problem) is left as arbitrary, 
and so is the choice of probability estimation, and therefore it is not fully 
principled nor fully foundational.

Please do not interpret this comment as a rejection/attack on the paper, rather 
I consider it a little incomplete and was hoping someone may have found a line 
of research more successful than my own to fill in the gaps.

> Feature selection for high dimensional datasets
> ---
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ignacio Zendejas
>Assignee: Alexander Ulanov
>Priority: Minor
>  Labels: features
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3445) Deprecate and later remove YARN alpha support

2014-10-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168577#comment-14168577
 ] 

Sean Owen commented on SPARK-3445:
--

YARN alpha is not deprecated or removed yet; this JIRA is not resolved. Even if 
it were deprecated it should still compile and work. This is an error 
introduced by a recent change. [~andrewor14] [~andrewor] can you have a look? 
this line looks like it was introduced in 
https://github.com/apache/spark/commit/c4022dd52b4827323ff956632dc7623f546da937 
/ SPARK-3477

> Deprecate and later remove YARN alpha support
> -
>
> Key: SPARK-3445
> URL: https://issues.apache.org/jira/browse/SPARK-3445
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Patrick Wendell
>
> This will depend a bit on both user demand and the commitment level of 
> maintainers, but I'd like to propose the following timeline for yarn-alpha 
> support.
> Spark 1.2: Deprecate YARN-alpha
> Spark 1.3: Remove YARN-alpha (i.e. require YARN-stable)
> Since YARN-alpha is clearly identified as an alpha API, it seems reasonable 
> to drop support for it in a minor release. However, it does depend a bit 
> whether anyone uses this outside of Yahoo!, and that I'm not sure of. In the 
> past this API has been used and maintained by Yahoo, but they'll be migrating 
> soon to the stable API's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org