[jira] [Updated] (SPARK-2469) Lower shuffle compression buffer memory usage

2014-07-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2469:
---

Summary: Lower shuffle compression buffer memory usage  (was: Lower shuffle 
compression memory usage)

> Lower shuffle compression buffer memory usage
> -
>
> Key: SPARK-2469
> URL: https://issues.apache.org/jira/browse/SPARK-2469
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2469) Lower shuffle compression memory usage

2014-07-13 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2469:
--

 Summary: Lower shuffle compression memory usage
 Key: SPARK-2469
 URL: https://issues.apache.org/jira/browse/SPARK-2469
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2467) Revert SparkBuild to publish-local to both .m2 and .ivy2.

2014-07-13 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060381#comment-14060381
 ] 

Takuya Ueshin commented on SPARK-2467:
--

PRed: https://github.com/apache/spark/pull/1398

> Revert SparkBuild to publish-local to both .m2 and .ivy2.
> -
>
> Key: SPARK-2467
> URL: https://issues.apache.org/jira/browse/SPARK-2467
> Project: Spark
>  Issue Type: Bug
>Reporter: Takuya Ueshin
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2468) zero-copy shuffle network communication

2014-07-13 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2468:
--

 Summary: zero-copy shuffle network communication
 Key: SPARK-2468
 URL: https://issues.apache.org/jira/browse/SPARK-2468
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical


Right now shuffle send goes through the block manager. This is inefficient 
because it requires loading a block from disk into a kernel buffer, then into a 
user space buffer, and then back to a kernel send buffer before it reaches the 
NIC. It does multiple copies of the data and context switching between 
kernel/user. It also creates unnecessary buffer in the JVM that increases GC

Instead, we should use FileChannel.transferTo, which handles this in the kernel 
space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/

One potential solution is to use Netty NIO.






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2468) zero-copy shuffle network communication

2014-07-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2468:
---

Component/s: Shuffle

> zero-copy shuffle network communication
> ---
>
> Key: SPARK-2468
> URL: https://issues.apache.org/jira/browse/SPARK-2468
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> Right now shuffle send goes through the block manager. This is inefficient 
> because it requires loading a block from disk into a kernel buffer, then into 
> a user space buffer, and then back to a kernel send buffer before it reaches 
> the NIC. It does multiple copies of the data and context switching between 
> kernel/user. It also creates unnecessary buffer in the JVM that increases GC
> Instead, we should use FileChannel.transferTo, which handles this in the 
> kernel space with zero-copy. See 
> http://www.ibm.com/developerworks/library/j-zerocopy/
> One potential solution is to use Netty NIO.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2467) Revert SparkBuild to publish-local to both .m2 and .ivy2.

2014-07-13 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-2467:


 Summary: Revert SparkBuild to publish-local to both .m2 and .ivy2.
 Key: SPARK-2467
 URL: https://issues.apache.org/jira/browse/SPARK-2467
 Project: Spark
  Issue Type: Bug
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2382) build error:

2014-07-13 Thread Mukul Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060374#comment-14060374
 ] 

Mukul Jain commented on SPARK-2382:
---

BTW, I do believe that it is not a spark issue...but I wanted to see if
spark documentation could be improved in someway to help through this issue
or even better avoid it altogether.

Mukul





> build error: 
> -
>
> Key: SPARK-2382
> URL: https://issues.apache.org/jira/browse/SPARK-2382
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 1.0.0
> Environment: Ubuntu 12.0.4 precise. 
> spark@ubuntu-cdh5-spark:~/spark-1.0.0$ mvn -version
> Apache Maven 3.0.4
> Maven home: /usr/share/maven
> Java version: 1.6.0_31, vendor: Sun Microsystems Inc.
> Java home: /usr/lib/jvm/j2sdk1.6-oracle/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "3.11.0-15-generic", arch: "amd64", family: "unix"
>Reporter: Mukul Jain
>  Labels: newbie
>
> Unable to build. maven can't download dependency .. checked my http_proxy and 
> https_proxy setting they are working fine. Other http and https dependencies 
> were downloaded fine.. build process gets stuck always at this repo. manually 
> down loading also fails and receive an exception. 
> [INFO] 
> 
> [INFO] Building Spark Project External MQTT 1.0.0
> [INFO] 
> 
> Downloading: 
> https://repository.apache.org/content/repositories/releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom
> Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector 
> executeWithRetry
> INFO: I/O exception (java.net.ConnectException) caught when processing 
> request: Connection timed out
> Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector 
> executeWithRetry
> INFO: Retrying request



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2466) Got two different block manager registrations

2014-07-13 Thread Alex Gaudio (JIRA)
Alex Gaudio created SPARK-2466:
--

 Summary: Got two different block manager registrations
 Key: SPARK-2466
 URL: https://issues.apache.org/jira/browse/SPARK-2466
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Mesos
Affects Versions: 1.0.0
 Environment: Mesos 0.19.0
Spark 1.0.0
(Ubuntu 14.04 LTS)
Reporter: Alex Gaudio


On PySpark and SparkR (haven't tried with Scala Spark) running on our Mesos 
cluster, we get the following error, which causes spark to fail.

```
ERROR BlockManagerMasterActor: Got two different block manager registrations on 
20140627-192758-654448812-5050-31629-42
```

We believe this is because tasks between two different stages may share the 
same task id if they run within the same second.  As a temporary workaround, we 
are adding a second of space between executions of lazily evaluated spark code. 
 This appears to solve the problem.

We don't see this issue running spark in local mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS

2014-07-13 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060300#comment-14060300
 ] 

Xiangrui Meng commented on SPARK-2465:
--

[~sowen] The ALS implementation shuffles data for each iteration. I tested ALS 
on 100x Amazon Reviews dataset. Each iteration shuffles about 200GB data (see 
the screenshot attached). If we switch to Long, ALS will definitely slow down. 
On the other hand, having a few hash collisions may not be a serious problem. 
That is essentially random dimensionality reduction and it also densifies the 
data, which helps ALS. We can estimate how many users/products we can handle if 
we allow 0.1% collision (should be couple million) and discuss more about the 
trade-offs.

> Use long as user / item ID for ALS
> --
>
> Key: SPARK-2465
> URL: https://issues.apache.org/jira/browse/SPARK-2465
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.1
>Reporter: Sean Owen
>Priority: Minor
> Attachments: Screen Shot 2014-07-13 at 8.49.40 PM.png
>
>
> I'd like to float this for consideration: use longs instead of ints for user 
> and product IDs in the ALS implementation.
> The main reason for is that identifiers are not generally numeric at all, and 
> will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
> means collisions are likely after hundreds of thousands of users and items, 
> which is not unrealistic. Hashing to 64 bits pushes this back to billions.
> It would also mean numeric IDs that happen to be larger than the largest int 
> can be used directly as identifiers.
> On the downside of course: 8 bytes instead of 4 bytes of memory used per 
> Rating.
> Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2465) Use long as user / item ID for ALS

2014-07-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2465:
-

Attachment: Screen Shot 2014-07-13 at 8.49.40 PM.png

> Use long as user / item ID for ALS
> --
>
> Key: SPARK-2465
> URL: https://issues.apache.org/jira/browse/SPARK-2465
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.1
>Reporter: Sean Owen
>Priority: Minor
> Attachments: Screen Shot 2014-07-13 at 8.49.40 PM.png
>
>
> I'd like to float this for consideration: use longs instead of ints for user 
> and product IDs in the ALS implementation.
> The main reason for is that identifiers are not generally numeric at all, and 
> will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
> means collisions are likely after hundreds of thousands of users and items, 
> which is not unrealistic. Hashing to 64 bits pushes this back to billions.
> It would also mean numeric IDs that happen to be larger than the largest int 
> can be used directly as identifiers.
> On the downside of course: 8 bytes instead of 4 bytes of memory used per 
> Rating.
> Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-953) Latent Dirichlet Association (LDA model)

2014-07-13 Thread Masaki Rikitoku (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060291#comment-14060291
 ] 

Masaki Rikitoku commented on SPARK-953:
---

Latent Dirichlet Allocation?

> Latent Dirichlet Association (LDA model)
> 
>
> Key: SPARK-953
> URL: https://issues.apache.org/jira/browse/SPARK-953
> Project: Spark
>  Issue Type: Story
>  Components: Examples
>Affects Versions: 0.7.3
>Reporter: caizhua
>Priority: Critical
>
> This code is for learning the LDA model. However, if our input is 2.5 M 
> documents per machine, a dictionary with 1 words, running in EC2 
> m2.4xlarge instance with 68 G memory each machine. The time is really really 
> slow. For five iterations, the time cost is 8145, 24725, 51688, 58674, 56850 
> seconds. The time for shuffling is quite slow. The LDA.tbl is the simulated 
> data set for the program, and it is quite fast.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2363) Clean MLlib's sample data files

2014-07-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2363.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1394
[https://github.com/apache/spark/pull/1394]

> Clean MLlib's sample data files
> ---
>
> Key: SPARK-2363
> URL: https://issues.apache.org/jira/browse/SPARK-2363
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Priority: Minor
> Fix For: 1.1.0
>
>
> MLlib has sample data under serveral folders:
> 1) data/mllib
> 2) data/
> 3) mllib/data/*
> Per previous discussion with [~matei], we want to put them under `data/mllib` 
> and clean outdated files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2363) Clean MLlib's sample data files

2014-07-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2363:
-

Assignee: Sean Owen

> Clean MLlib's sample data files
> ---
>
> Key: SPARK-2363
> URL: https://issues.apache.org/jira/browse/SPARK-2363
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.0
>
>
> MLlib has sample data under serveral folders:
> 1) data/mllib
> 2) data/
> 3) mllib/data/*
> Per previous discussion with [~matei], we want to put them under `data/mllib` 
> and clean outdated files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2354) BitSet Range Expanded when creating new one

2014-07-13 Thread Yijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060279#comment-14060279
 ] 

Yijie Shen commented on SPARK-2354:
---

For the methods available in BitSet, they have the same effect.

But what if I want to implement a `complement` or `xnor` method? Since the 
Iterator's hasNext method only check for nextSetBit's index is >= 0, when 
iterating the complement bitset, indexes out of range will return, between the 
numBits and capacity.

> BitSet Range Expanded when creating new one
> ---
>
> Key: SPARK-2354
> URL: https://issues.apache.org/jira/browse/SPARK-2354
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Yijie Shen
>Priority: Minor
>
> BitSet has a constructor parameter named "numBits: Int" and indicate the bit 
> num inside.
> And also, there is a function called "capacity" which represents the long 
> words number to hold the bits.
> When creating new BitSet,for example in '|', I thought the new created one 
> shouldn't be the size of longer words' length, instead, it should be the 
> longer set's num of bit
> {code}def |(other: BitSet): BitSet = {
> val newBS = new BitSet(math.max(numBits, other.numBits)) 
> // I know by now the numBits isn't a field
> {code}
> Does it have any other reason to expand the BitSet range I don't know?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2278) groupBy & groupByKey should support custom comparator

2014-07-13 Thread Hans Uhlig (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060267#comment-14060267
 ] 

Hans Uhlig commented on SPARK-2278:
---

So I just checked with the current 1.0.0 api and JavaPairRDD implements the 
following. (There was no SortBy that I could find)

JavaPairRDD JavaPairRDD.sortByKey()
JavaPairRDD JavaPairRDD.sortByKey(Comparator comp)
JavaPairRDD JavaPairRDD.sortByKey(boolean ascending)
JavaPairRDD JavaPairRDD.sortByKey(Comparator comp, boolean ascending)
JavaPairRDD JavaPairRDD.sortByKey(Comparator comp, boolean ascending, int 
numPartitions)
JavaPairRDD> JavaRDD.groupBy(  groupBy(Function arg0) )
JavaPairRDD> JavaRDD.groupBy( 
JavaPairRDD>> groupBy(Function,K> f) )
JavaPairRDD> JavaRDD.groupBy( JavaPairRDD> 
groupBy(Function arg0, int arg1) )
JavaPairRDD> JavaRDD.groupBy( 
JavaPairRDD>> groupBy(Function,K> f, int 
numPartitions) )
JavaPairRDD.groupByKey()
JavaPairRDD.groupByKey(Partitioner partitioner )
JavaPairRDD.groupByKey(int numPartitions )

The base non implied parameter functions should provide the following 
interfaces for optimum control and flexibility:

JavaRDD JavaRDD.sortBy(Comparator comp, boolean ascending, Partitioner 
partitioner, int numPartitions)

JavaPairRDD JavaPairRDD.sortByKey(Comparator comp, boolean ascending, 
Partitioner partitioner, int numPartitions)

JavaRDD> JavaRDD.groupBy(JavaPairRDD> 
groupBy(Function func()), Comparator comp, boolean ascending, Partitioner 
partitioner, int numPartitions)

JavaPairRDD> JavaPairRDD.groupByKey( JavaPairRDD> 
groupBy(Function func), Comparator comp, boolean ascending, Partitioner 
partitioner, int numPartitions)

GroupByKey's function Reference should look something like "Iterable 
Function (K key, Iterable values)"

Unless there is a different function to do that particular job that I am 
missing. The lack of descriptions for what the inputs and outputs of the 
function references should do make that a bit difficult to discern sometimes.



> groupBy & groupByKey should support custom comparator
> -
>
> Key: SPARK-2278
> URL: https://issues.apache.org/jira/browse/SPARK-2278
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API
>Affects Versions: 1.0.0
>Reporter: Hans Uhlig
>
> To maintain parity with MapReduce you should be able to specify a custom key 
> equality function in groupBy/groupByKey similar to sortByKey. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1945) Add full Java examples in MLlib docs

2014-07-13 Thread Michael Yannakopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060244#comment-14060244
 ] 

Michael Yannakopoulos commented on SPARK-1945:
--

I am willing to provide a java example for decision trees as well as to enhance 
the java example provided in the naive-bayes section.
What is more, I would like to ask you if there is an equivalent class for 
scala/spark RowMatrix in the equivalent python api.
This is because I would like to provide examples in the dimensionality 
reduction section of mllib documentation using python.

Thanks,
Michael

> Add full Java examples in MLlib docs
> 
>
> Key: SPARK-1945
> URL: https://issues.apache.org/jira/browse/SPARK-1945
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Matei Zaharia
>  Labels: Starter
> Fix For: 1.0.0
>
>
> Right now some of the Java tabs only say the following:
> "All of MLlib’s methods use Java-friendly types, so you can import and call 
> them there the same way you do in Scala. The only caveat is that the methods 
> take Scala RDD objects, while the Spark Java API uses a separate JavaRDD 
> class. You can convert a Java RDD to a Scala one by calling .rdd() on your 
> JavaRDD object."
> Would be nice to translate the Scala code into Java instead.
> Also, a few pages (most notably the Matrix one) don't have Java examples at 
> all.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1945) Add full Java examples in MLlib docs

2014-07-13 Thread Michael Yannakopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060243#comment-14060243
 ] 

Michael Yannakopoulos commented on SPARK-1945:
--

Hello guys,

I have provided Java examples for the following documentation files:
mllib-clustering.md
mllib-collaborative-filtering.md
mllib-dimensionality-reduction.md
mllib-linear-methods.md
mllib-optimization.md

My pull request is: [https://github.com/apache/spark/pull/1311]
Enjoy and do not hesitate to contact me for any remark/correction.

Thanks,
Michael

> Add full Java examples in MLlib docs
> 
>
> Key: SPARK-1945
> URL: https://issues.apache.org/jira/browse/SPARK-1945
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Matei Zaharia
>  Labels: Starter
> Fix For: 1.0.0
>
>
> Right now some of the Java tabs only say the following:
> "All of MLlib’s methods use Java-friendly types, so you can import and call 
> them there the same way you do in Scala. The only caveat is that the methods 
> take Scala RDD objects, while the Spark Java API uses a separate JavaRDD 
> class. You can convert a Java RDD to a Scala one by calling .rdd() on your 
> JavaRDD object."
> Would be nice to translate the Scala code into Java instead.
> Also, a few pages (most notably the Matrix one) don't have Java examples at 
> all.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2158) FileAppenderSuite is not cleaning up after itself

2014-07-13 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060201#comment-14060201
 ] 

Mark Hamstra commented on SPARK-2158:
-

This is fixed at 4cb33a83e0 from https://github.com/apache/spark/pull/1100


> FileAppenderSuite is not cleaning up after itself
> -
>
> Key: SPARK-2158
> URL: https://issues.apache.org/jira/browse/SPARK-2158
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Trivial
> Fix For: 1.1.0
>
>
> FileAppenderSuite is leaving behind the file core/stdout



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2158) FileAppenderSuite is not cleaning up after itself

2014-07-13 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra resolved SPARK-2158.
-

Resolution: Fixed

> FileAppenderSuite is not cleaning up after itself
> -
>
> Key: SPARK-2158
> URL: https://issues.apache.org/jira/browse/SPARK-2158
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Trivial
> Fix For: 1.1.0
>
>
> FileAppenderSuite is leaving behind the file core/stdout



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2363) Clean MLlib's sample data files

2014-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060177#comment-14060177
 ] 

Sean Owen commented on SPARK-2363:
--

I made myself useful with a PR for this one -- yes good cleanup: 
https://github.com/apache/spark/pull/1394

> Clean MLlib's sample data files
> ---
>
> Key: SPARK-2363
> URL: https://issues.apache.org/jira/browse/SPARK-2363
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Priority: Minor
>
> MLlib has sample data under serveral folders:
> 1) data/mllib
> 2) data/
> 3) mllib/data/*
> Per previous discussion with [~matei], we want to put them under `data/mllib` 
> and clean outdated files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1949) Servlet 2.5 vs 3.0 conflict in SBT build

2014-07-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1949.
--

Resolution: Won't Fix

Obsoleted by SBT build changes.

> Servlet 2.5 vs 3.0 conflict in SBT build
> 
>
> Key: SPARK-1949
> URL: https://issues.apache.org/jira/browse/SPARK-1949
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Sean Owen
>Priority: Minor
>
> [~kayousterhout] mentioned that:
> {quote}
> I had some trouble compiling an application (Shark) against Spark 1.0,
> where Shark had a runtime exception (at the bottom of this message) because
> it couldn't find the javax.servlet classes.  SBT seemed to have trouble
> downloading the servlet APIs that are dependencies of Jetty (used by the
> Spark web UI), so I had to manually add them to the application's build
> file:
> libraryDependencies += "org.mortbay.jetty" % "servlet-api" % "3.0.20100224"
> Not exactly sure why this happens but thought it might be useful in case
> others run into the same problem.
> {quote}
> This is a symptom of Servlet API conflict which we battled in the Maven 
> build. The resolution is to nix Servlet 2.5 and odd old Jetty / Netty 3.x 
> dependencies. It looks like the Hive part of the assembly in the SBT build 
> doesn't exclude all these entirely.
> I'll open a suggested PR to band-aid the SBT build.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2158) FileAppenderSuite is not cleaning up after itself

2014-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060161#comment-14060161
 ] 

Sean Owen commented on SPARK-2158:
--

I tried to clean this up a while ago, but I think it predates your comment.  I 
don't see this file however after running tests. Is it maybe due to an unusual 
termination in the test? I don't see this file created either.

> FileAppenderSuite is not cleaning up after itself
> -
>
> Key: SPARK-2158
> URL: https://issues.apache.org/jira/browse/SPARK-2158
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Mark Hamstra
>Assignee: Mark Hamstra
>Priority: Trivial
> Fix For: 1.1.0
>
>
> FileAppenderSuite is leaving behind the file core/stdout



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2278) groupBy & groupByKey should support custom comparator

2014-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060158#comment-14060158
 ] 

Sean Owen commented on SPARK-2278:
--

Isn't this exactly what the first argument to groupBy and sortBy does? You 
define grouping and sorting on a transformation of the key instead of the key. 
It's not precisely what you mean but has the same effect. Maybe more 
importantly it matches Scala's collections API.

> groupBy & groupByKey should support custom comparator
> -
>
> Key: SPARK-2278
> URL: https://issues.apache.org/jira/browse/SPARK-2278
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API
>Affects Versions: 1.0.0
>Reporter: Hans Uhlig
>
> To maintain parity with MapReduce you should be able to specify a custom key 
> equality function in groupBy/groupByKey similar to sortByKey. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2354) BitSet Range Expanded when creating new one

2014-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060156#comment-14060156
 ] 

Sean Owen commented on SPARK-2354:
--

These end up with the same effect.

Let's say A is created with numBits=50 and B is created with numBits=70. A will 
have a capacity of 64 and B will have a capacity of 128, since they internally 
allocate 1 and 2 longs of storage, respectively.

A|B needs to accommodate at least 70 bits, yes. Whether it is created with 
numBits=70 (your suggestion) or numBits=128 (the current code), you end up with 
a capacity of 128.

Nothing is being expanded needlessly; the result is the same.
I think the current code is fine.

> BitSet Range Expanded when creating new one
> ---
>
> Key: SPARK-2354
> URL: https://issues.apache.org/jira/browse/SPARK-2354
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Yijie Shen
>Priority: Minor
>
> BitSet has a constructor parameter named "numBits: Int" and indicate the bit 
> num inside.
> And also, there is a function called "capacity" which represents the long 
> words number to hold the bits.
> When creating new BitSet,for example in '|', I thought the new created one 
> shouldn't be the size of longer words' length, instead, it should be the 
> longer set's num of bit
> {code}def |(other: BitSet): BitSet = {
> val newBS = new BitSet(math.max(numBits, other.numBits)) 
> // I know by now the numBits isn't a field
> {code}
> Does it have any other reason to expand the BitSet range I don't know?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2014-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060152#comment-14060152
 ] 

Sean Owen commented on SPARK-2356:
--

This isn't specific to Spark: 
http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path

And if you look at when this code is called in SparkContext, it's from the 
hadoopRDD() method. You will certainly end up using Hadoop code if your code 
access Hadoop functionality, so I think it is behaving as expected.

> Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
> ---
>
> Key: SPARK-2356
> URL: https://issues.apache.org/jira/browse/SPARK-2356
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Kostiantyn Kudriavtsev
>
> I'm trying to run some transformation on Spark, it works fine on cluster 
> (YARN, linux machines). However, when I'm trying to run it on local machine 
> (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
> from local filesystem):
> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
>   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>   at org.apache.hadoop.util.Shell.(Shell.java:326)
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>   at org.apache.hadoop.security.Groups.(Groups.java:77)
>   at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>   at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
>   at org.apache.spark.SparkContext.(SparkContext.scala:228)
>   at org.apache.spark.SparkContext.(SparkContext.scala:97)
> It's happend because Hadoop config is initialised each time when spark 
> context is created regardless is hadoop required or not.
> I propose to add some special flag to indicate if hadoop config is required 
> (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2414) Remove jquery

2014-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060151#comment-14060151
 ] 

Sean Owen commented on SPARK-2414:
--

Note that jQuery is MIT licensed. It's fine to include its source but the Spark 
LICENSE file needs to reference it and its license if it's kept. Take a look 
for the section in that file, and see 
http://www.apache.org/dev/licensing-howto.html

Or of course removing it moots the point.

> Remove jquery
> -
>
> Key: SPARK-2414
> URL: https://issues.apache.org/jira/browse/SPARK-2414
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Minor
>
> SPARK-2384 introduces jquery for tooltip display. We can probably just create 
> a very simple javascript for tooltip instead of pulling in jquery. 
> https://github.com/apache/spark/pull/1314



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2442) Add a Hadoop Writable serializer

2014-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060150#comment-14060150
 ] 

Sean Owen commented on SPARK-2442:
--

I think this duplicates https://issues.apache.org/jira/browse/SPARK-2421

> Add a Hadoop Writable serializer
> 
>
> Key: SPARK-2442
> URL: https://issues.apache.org/jira/browse/SPARK-2442
> Project: Spark
>  Issue Type: Bug
>Reporter: Hari Shreedharan
>
> Using data read from hadoop files in shuffles can cause exceptions with the 
> following stacktrace:
> {code}
> java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1181)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:179)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at org.apache.spark.scheduler.Task.run(Task.scala:51)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:679)
> {code}
> This though seems to go away if Kyro serializer is used. I am wondering if 
> adding a Hadoop-writables friendly serializer makes sense as it is likely to 
> perform better than Kyro without registration, since Writables don't 
> implement Serializable - so the serialization might not be the most efficient.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-524) spark integration issue with Cloudera hadoop

2014-07-13 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060145#comment-14060145
 ] 

Nicholas Chammas commented on SPARK-524:


+1 for cleanup of issues that likely have no further action.

> spark integration issue with Cloudera hadoop
> 
>
> Key: SPARK-524
> URL: https://issues.apache.org/jira/browse/SPARK-524
> Project: Spark
>  Issue Type: Bug
>Reporter: openreserach
>
> Hi, 
> 1. I am using single EC2 instance with pre-built mesos (ami-0fcb7966) (Same 
> issue if I build mesos from source code in locall VM)
> 2. Follow instruction on 
> https://github.com/mesos/spark/wiki/Running-spark-on-mesos with some tweaks.
> 3. I install Cloudera cdhu5 by yum (not using pre-built hadoop due to lack of 
> document)
> 4. ./spartk-shell.sh
> import spark._
> val sc = new SparkContext("localhost:5050","passwd")
> val ec2 = sc.textFile("hdfs://localhost:8020/tmp/passwd")
> IF I keep val HADOOP_VERSION = "0.20.205.0" in project/SparkBuild.scala
> at val file = sc.textFile("hdfs://localhost:8020/tmp/passwd")
> I am getting error
> Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. 
> (client = 61, server = 63)
> IF I set val HADOOP_VERSION = "0.20.2-cdh3u5" or val HADOOP_VERSION = 
> "0.20.2-cdh3u3" 
> I am getting error at  ec2.count()
> ERROR spark.SimpleJob: Task 0:0 failed more than 4 times; aborting job
> like the one reported at 
> http://mail-archives.apache.org/mod_mbox/incubator-mesos-dev/201108.mbox/%3cbd25ae7a-c9dc-4020-ad40-41c66dcaa...@eecs.berkeley.edu%3E
> Please let me know if you cannot replicate this error, and give more 
> instruction on how Spark integrate with Cloudera Hadoop 
> Thanks
> -QH



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS

2014-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060127#comment-14060127
 ] 

Sean Owen commented on SPARK-2465:
--

https://github.com/apache/spark/pull/1393

> Use long as user / item ID for ALS
> --
>
> Key: SPARK-2465
> URL: https://issues.apache.org/jira/browse/SPARK-2465
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.1
>Reporter: Sean Owen
>Priority: Minor
>
> I'd like to float this for consideration: use longs instead of ints for user 
> and product IDs in the ALS implementation.
> The main reason for is that identifiers are not generally numeric at all, and 
> will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
> means collisions are likely after hundreds of thousands of users and items, 
> which is not unrealistic. Hashing to 64 bits pushes this back to billions.
> It would also mean numeric IDs that happen to be larger than the largest int 
> can be used directly as identifiers.
> On the downside of course: 8 bytes instead of 4 bytes of memory used per 
> Rating.
> Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2465) Use long as user / item ID for ALS

2014-07-13 Thread Sean Owen (JIRA)
Sean Owen created SPARK-2465:


 Summary: Use long as user / item ID for ALS
 Key: SPARK-2465
 URL: https://issues.apache.org/jira/browse/SPARK-2465
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor


I'd like to float this for consideration: use longs instead of ints for user 
and product IDs in the ALS implementation.

The main reason for is that identifiers are not generally numeric at all, and 
will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
means collisions are likely after hundreds of thousands of users and items, 
which is not unrealistic. Hashing to 64 bits pushes this back to billions.

It would also mean numeric IDs that happen to be larger than the largest int 
can be used directly as identifiers.

On the downside of course: 8 bytes instead of 4 bytes of memory used per Rating.

Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-13 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060113#comment-14060113
 ] 

Mridul Muralidharan commented on SPARK-2398:



As discussed in the PR, I am attempting to list the various factors which 
contribute to overhead.
Note, this is not exhaustive (yet) - please add more to this JIRA - so that 
when we are reasonably sure, we can model the expected overhead based on these 
factors.

These factors are typically off-heap - since anything within heap is budgetted 
for by Xmx - and enforced by VM : and so should ideally (not practically 
always, see gc overheads) not exceed the Xmx value

1) 256 KB per socket accepted via ConnectionManager for inter-worker comm 
(setReceiveBufferSize)
Typically, there will be (numExecutor - 1) number of sockets open.

2) 128 KB per socket for writing output to dfs. For reads, this does not seem 
to be configured - and should be 8k per socket iirc.
Typically 1 per executor at a given point in time ?

3) 256k for each akka socket for send/receive buffer.
One per worker ? (to talk to master) - so 512kb ? Any other use of akka ?

4) If I am not wrong, netty might allocate multiple "spark.akka.frameSize" 
sized direct buffer. There might be a few of these allocated and pooled/reused.
I did not go in detail into netty code though. If someone else with more 
knowhow can clarify, that would be great !
Default size of 10mb for spark.akka.frameSize

5) The default size of the assembled spark jar is about 12x mb (and changing) - 
though not all classes get loaded, the overhead would be some function of this.
The actual footprint would be higher than the on-disk size.
IIRC this is outside of the heap - [~sowen], any comments on this ? I have not 
looked into these in like 10 years now !

6) Per thread (Xss) overhead of 1mb (for 64bit vm).
Last I recall, we have about 220 odd threads - not sure if this was at the 
master or on the workers.
Ofcourse, this is dependent on the various threadpools we use (io, computation, 
etc), akka and netty config, etc.

7) Disk read overhead.
Thanks for [~pwendell]'s fix, atleast for small files, the overhead is not too 
high - since we do not mmap files but directly read them.
But for anything larger than 8kb (default), we use memory mapped buffers.
The actual overhead depends on the number of files opened for read via 
DiskStore - and the entire file contents get mmap'ed into virt mem.
Note that there is some non-virt-mem overhead also at native level for these 
buffers.

The actual number of files opened should be carefully tracked to understand the 
effect of this on spark overhead : since this aspect is changing a lot off late.
Impact is on shuffle,  disk persisted rdd, among others.
The actual value would be application dependent (how large the data is !)


8) The overhead introduced by VM not being able to reclaim memory completely 
(the cost of moving data vs amount of space reclaimed).
Ideally, this should be low - but would be dependent on the heap space, 
collector used, among other things.
I am not very knowledgable of the recent advances in gc collectors, so I 
hesitate to put a number to this.



I am sure this is not an exhaustive list, please do add to this.
In our case specifically, and [~tgraves] could add more, the number of 
containers can be high (300+ is easily possible), memory per container is 
modest (8gig usually).
To add details of observed overhead patterns (from the PR discussion) - 
a) I have had inhouse GBDT impl run without customizing overhead (so default of 
384 mb) with 12gb container and 22 nodes on reasonably large dataset.
b) I have had to customize overhead to 1.7gb for collaborative filtering with 
8gb container and 300 nodes (on a fairly large dataset).
c) I have had to minimally customize overhead to do inhouse QR factorization of 
a 50k x 50k distributed dense matrix on 45 nodes at 12 gb each (this was 
incorrectly specified in the PR discussion).

> Trouble running Spark 1.0 on Yarn 
> --
>
> Key: SPARK-2398
> URL: https://issues.apache.org/jira/browse/SPARK-2398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without 
> any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
> runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
> cluster mode) as the input data size is increased. Confirmed for 16GB input 
> dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn 
> cluster mode (for up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark

[jira] [Commented] (SPARK-524) spark integration issue with Cloudera hadoop

2014-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060070#comment-14060070
 ] 

Sean Owen commented on SPARK-524:
-

Can I ask a meta-question? This JIRA is an example, but just one. I see 
hundreds of JIRAs that likely have no further action.

Some are likely obsoleted by time and subsequent changes, like this one -- CDH 
integration is much different now and presumably fixes this. Some are feature 
requests or changes that de facto don't have support and therefore won't be 
committed. These seem like they should be closed, for clarity. Bugs are riskier 
to close in case they identify a real issue that still exists.

Is there any momentum for, or anything I can do, to help clean up things like 
this just to start?

> spark integration issue with Cloudera hadoop
> 
>
> Key: SPARK-524
> URL: https://issues.apache.org/jira/browse/SPARK-524
> Project: Spark
>  Issue Type: Bug
>Reporter: openreserach
>
> Hi, 
> 1. I am using single EC2 instance with pre-built mesos (ami-0fcb7966) (Same 
> issue if I build mesos from source code in locall VM)
> 2. Follow instruction on 
> https://github.com/mesos/spark/wiki/Running-spark-on-mesos with some tweaks.
> 3. I install Cloudera cdhu5 by yum (not using pre-built hadoop due to lack of 
> document)
> 4. ./spartk-shell.sh
> import spark._
> val sc = new SparkContext("localhost:5050","passwd")
> val ec2 = sc.textFile("hdfs://localhost:8020/tmp/passwd")
> IF I keep val HADOOP_VERSION = "0.20.205.0" in project/SparkBuild.scala
> at val file = sc.textFile("hdfs://localhost:8020/tmp/passwd")
> I am getting error
> Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch. 
> (client = 61, server = 63)
> IF I set val HADOOP_VERSION = "0.20.2-cdh3u5" or val HADOOP_VERSION = 
> "0.20.2-cdh3u3" 
> I am getting error at  ec2.count()
> ERROR spark.SimpleJob: Task 0:0 failed more than 4 times; aborting job
> like the one reported at 
> http://mail-archives.apache.org/mod_mbox/incubator-mesos-dev/201108.mbox/%3cbd25ae7a-c9dc-4020-ad40-41c66dcaa...@eecs.berkeley.edu%3E
> Please let me know if you cannot replicate this error, and give more 
> instruction on how Spark integrate with Cloudera Hadoop 
> Thanks
> -QH



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2382) build error:

2014-07-13 Thread Mukul Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060045#comment-14060045
 ] 

Mukul Jain commented on SPARK-2382:
---

Hi Sean,

Well  I thought it required  more than proxy setting fix. If I have a chance 
then I will try to reproduce it next week.  If you want to close in the 
meanwhile then that is fine 

I am not blocked anymore.

Thanks 

Sent from my iPhone



> build error: 
> -
>
> Key: SPARK-2382
> URL: https://issues.apache.org/jira/browse/SPARK-2382
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 1.0.0
> Environment: Ubuntu 12.0.4 precise. 
> spark@ubuntu-cdh5-spark:~/spark-1.0.0$ mvn -version
> Apache Maven 3.0.4
> Maven home: /usr/share/maven
> Java version: 1.6.0_31, vendor: Sun Microsystems Inc.
> Java home: /usr/lib/jvm/j2sdk1.6-oracle/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "3.11.0-15-generic", arch: "amd64", family: "unix"
>Reporter: Mukul Jain
>  Labels: newbie
>
> Unable to build. maven can't download dependency .. checked my http_proxy and 
> https_proxy setting they are working fine. Other http and https dependencies 
> were downloaded fine.. build process gets stuck always at this repo. manually 
> down loading also fails and receive an exception. 
> [INFO] 
> 
> [INFO] Building Spark Project External MQTT 1.0.0
> [INFO] 
> 
> Downloading: 
> https://repository.apache.org/content/repositories/releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom
> Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector 
> executeWithRetry
> INFO: I/O exception (java.net.ConnectException) caught when processing 
> request: Connection timed out
> Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector 
> executeWithRetry
> INFO: Retrying request



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2382) build error:

2014-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060036#comment-14060036
 ] 

Sean Owen commented on SPARK-2382:
--

It sounds like it was an issue with your proxy then, no? That is indeed common, 
but it is not related to Spark.

> build error: 
> -
>
> Key: SPARK-2382
> URL: https://issues.apache.org/jira/browse/SPARK-2382
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 1.0.0
> Environment: Ubuntu 12.0.4 precise. 
> spark@ubuntu-cdh5-spark:~/spark-1.0.0$ mvn -version
> Apache Maven 3.0.4
> Maven home: /usr/share/maven
> Java version: 1.6.0_31, vendor: Sun Microsystems Inc.
> Java home: /usr/lib/jvm/j2sdk1.6-oracle/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "3.11.0-15-generic", arch: "amd64", family: "unix"
>Reporter: Mukul Jain
>  Labels: newbie
>
> Unable to build. maven can't download dependency .. checked my http_proxy and 
> https_proxy setting they are working fine. Other http and https dependencies 
> were downloaded fine.. build process gets stuck always at this repo. manually 
> down loading also fails and receive an exception. 
> [INFO] 
> 
> [INFO] Building Spark Project External MQTT 1.0.0
> [INFO] 
> 
> Downloading: 
> https://repository.apache.org/content/repositories/releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom
> Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector 
> executeWithRetry
> INFO: I/O exception (java.net.ConnectException) caught when processing 
> request: Connection timed out
> Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector 
> executeWithRetry
> INFO: Retrying request



--
This message was sent by Atlassian JIRA
(v6.2#6252)