date:20141127


[ 
https://issues.apache.org/jira/browse/SPARK-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227372#comment-14227372
 ] 

Sean Owen commented on SPARK-4628:
--

Here are all the non-Central repos currently used:

{code}
  urlhttps://repository.apache.org/content/repositories/releases/url
  
urlhttps://repository.jboss.org/nexus/content/repositories/releases/url
  urlhttps://repo.eclipse.org/content/repositories/paho-releases/url
  urlhttps://repository.cloudera.com/artifactory/cloudera-repos/url
  urlhttp://repository.mapr.com/maven/url
  urlhttps://repo.spring.io/libs-release/url
  
urlhttps://oss.sonatype.org/content/repositories/orgspark-project-1085/url
  
urlhttps://oss.sonatype.org/content/repositories/orgspark-project-1089//url
  
urlhttps://repository.apache.org/content/repositories/orgapachespark-1038//url
{code}

Last 3 are temporary. The vendor repos, well, separate question. Might be 
interesting to do the same exercise with anything else in the secondary repos, 
like see what breaks from a clean local repository if these don't exist.

 Put all external projects behind a build flag
 -

 Key: SPARK-4628
 URL: https://issues.apache.org/jira/browse/SPARK-4628
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Wendell
Priority: Blocker

 This is something we talked about doing for convenience, but I'm escalating 
 this based on realizing today that some of our external projects depend on 
 code that is not in maven central. I.e. if one of these dependencies is taken 
 down (as happened recently with mqtt), all Spark builds will fail.
 The proposal here is simple, have a profile -Pexternal-projects that enables 
 these. This can follow the exact pattern of -Pkinesis-asl which was disabled 
 by default due to a license issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4315) PySpark pickling of pyspark.sql.Row objects is extremely inefficient

2014-11-27 Thread Lv, Qi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227376#comment-14227376
 ] 

Lv, Qi commented on SPARK-4315:
---

I'm interested in this issue, but I can't reproduce your problem. 
I constructed a very simple workload according to your description, like this:

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext(appName=test)
sqlContext = SQLContext(sc)

lines = sc.parallelize(range(100), 12)
people = lines.map(lambda x:{name: str(x % 1000), age:x})

schemaPeople = sqlContext.inferSchema(people)
schemaPeople.registerAsTable(people)

grouped = schemaPeople.groupBy(lambda x:x.name)
grouped.collect()

And tested over spark-1.1(2f9b2bd) and spark-master(0fe54cff).
It finished in 3-4 seconds on both spark versions.

After disabled _restore_object's cache ( adding return 
_create_cls(dataType)(obj) ), it becomes obviously slow(waited minutes, no 
need to wait more).

Could you please give me more detailed information?

 PySpark pickling of pyspark.sql.Row objects is extremely inefficient
 

 Key: SPARK-4315
 URL: https://issues.apache.org/jira/browse/SPARK-4315
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: Ubuntu, Python 2.7, Spark 1.1.0
Reporter: Adam Davison

 Working with an RDD of pyspark.sql.Row objects, created by reading a file 
 with SQLContext in a local PySpark context.
 Operations on the RDD, such as: data.groupBy(lambda x: x.field_name) are 
 extremely slow (more than 10x slower than an equivalent Scala/Spark 
 implementation). Obviously I expected it to be somewhat slower, but I did a 
 bit of digging given the difference was so huge.
 Luckily it's fairly easy to add profiling to the Python workers. I see that 
 the vast majority of time is spent in:
 spark-1.1.0-bin-cdh4/python/pyspark/sql.py:757(_restore_object)
 It seems that this line attempts to accelerate pickling of Rows with the use 
 of a cache. Some debugging reveals that this cache becomes quite big (100s of 
 entries). Disabling the cache by adding:
 return _create_cls(dataType)(obj)
 as the first line of _restore_object made my query run 5x faster. Implying 
 that the caching is not providing the desired acceleration...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client


[ 
https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227380#comment-14227380
 ] 

Apache Spark commented on SPARK-4632:
-

User 'prabeesh' has created a pull request for this issue:
https://github.com/apache/spark/pull/3495

 Upgrade MQTT dependency to use latest mqtt-client
 -

 Key: SPARK-4632
 URL: https://issues.apache.org/jira/browse/SPARK-4632
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is 
 breaking Spark build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4170) Closure problems when running Scala app that extends App


[ 
https://issues.apache.org/jira/browse/SPARK-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227385#comment-14227385
 ] 

Sean Owen commented on SPARK-4170:
--

Thanks [~boyork], I will propose a PR that resolves this with a bit of 
documentation somewhere.

 Closure problems when running Scala app that extends App
 --

 Key: SPARK-4170
 URL: https://issues.apache.org/jira/browse/SPARK-4170
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Sean Owen
Priority: Minor

 Michael Albert noted this problem on the mailing list 
 (http://apache-spark-user-list.1001560.n3.nabble.com/BUG-when-running-as-quot-extends-App-quot-closures-don-t-capture-variables-td17675.html):
 {code}
 object DemoBug extends App {
 val conf = new SparkConf()
 val sc = new SparkContext(conf)
 val rdd = sc.parallelize(List(A,B,C,D))
 val str1 = A
 val rslt1 = rdd.filter(x = { x != A }).count
 val rslt2 = rdd.filter(x = { str1 != null  x != A }).count
 
 println(DemoBug: rslt1 =  + rslt1 +  rslt2 =  + rslt2)
 }
 {code}
 This produces the output:
 {code}
 DemoBug: rslt1 = 3 rslt2 = 0
 {code}
 If instead there is a proper main(), it works as expected.
 I also this week noticed that in a program which extends App, some values 
 were inexplicably null in a closure. When changing to use main(), it was fine.
 I assume there is a problem with variables not being added to the closure 
 when main() doesn't appear in the standard way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4636) Cluster By Distribute By output different with Hive

2014-11-27 Thread Cheng Hao (JIRA)

Cheng Hao created SPARK-4636:


 Summary: Cluster By  Distribute By output different with Hive
 Key: SPARK-4636
 URL: https://issues.apache.org/jira/browse/SPARK-4636
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao


This is a very interesting bug.
Semantically, Cluster By  Distribute By will not cause a global ordering, as 
described in Hive wiki:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

However, the partition keys are sorted in MapReduce after shuffle, so from the 
user point of view, the partition key itself is global ordered, and it may 
looks like:
http://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4631) Add real unit test for MQTT


[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227402#comment-14227402
 ] 

Prabeesh K edited comment on SPARK-4631 at 11/27/14 8:49 AM:
-

MQTT is known as protocol of IoT(Internet of Things). It is widely used in IoT 
area.


was (Author: prabeeshk):
MQTT is know as protocol of IoT(Internet of Things). It is widely used in IoT 
area.

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical

 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4636) Cluster By Distribute By output different with Hive


[ 
https://issues.apache.org/jira/browse/SPARK-4636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227401#comment-14227401
 ] 

Apache Spark commented on SPARK-4636:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/3496

 Cluster By  Distribute By output different with Hive
 -

 Key: SPARK-4636
 URL: https://issues.apache.org/jira/browse/SPARK-4636
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 This is a very interesting bug.
 Semantically, Cluster By  Distribute By will not cause a global ordering, as 
 described in Hive wiki:
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
 However, the partition keys are sorted in MapReduce after shuffle, so from 
 the user point of view, the partition key itself is global ordered, and it 
 may looks like:
 http://stackoverflow.com/questions/13715044/hive-cluster-by-vs-order-by-vs-sort-by



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4631) Add real unit test for MQTT


[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227402#comment-14227402
 ] 

Prabeesh K commented on SPARK-4631:
---

MQTT is know as protocol of IoT(Internet of Things). It is widely used in IoT 
area.

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical

 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4637) spark-1.1.0 does not compile any more

2014-11-27 Thread Olaf Flebbe (JIRA)

Olaf Flebbe created SPARK-4637:
--

 Summary: spark-1.1.0 does not compile any more
 Key: SPARK-4637
 URL: https://issues.apache.org/jira/browse/SPARK-4637
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0, 0.9.1
Reporter: Olaf Flebbe
Priority: Critical


Spark does not compile anymore since the dependency mqtt-client-0.4.0 has been 
removed from the eclipse repository.  

See yourself: 
https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/
 

and

{code}
spark-1.1.0$ grep -C2 mqtt-client ./external/mqtt/pom.xml
dependency
  groupIdorg.eclipse.paho/groupId
  artifactIdmqtt-client/artifactId
   version0.4.0/version
/dependency
{code}

I did not find a different repository providing it. Since I accidentially 
removed my maven cache I connot compile spark any more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4315) PySpark pickling of pyspark.sql.Row objects is extremely inefficient

2014-11-27 Thread Adam Davison (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227427#comment-14227427
 ] 

Adam Davison commented on SPARK-4315:
-

Sure, will try to say what I can. Unfortunately I don't think I can easily give 
you a sample of the data. If we can't figure it out I can try to produce a fake 
sample that still exhibits the problem. But first let me try to come up with a 
few possibly salient differences:

1. My data is very wide, about 80 columns.
2. This size of the resulting groups in the groupBy is very ragged, whereas 
yours here is very even. Probably exponentially distributed in my case or more 
extreme. I wonder if this is generating many different Row types somehow.
3. My Row objects are constructed via the parquet functions of SQLContext

In my debugging I noticed that the cache size was reaching hundreds of entries 
or more from printing the number of items in the dict.

I'll also include part of the code I was using:

conf = pyspark.SparkConf().setMaster(local[24]).setAppName(test)
sc = pyspark.SparkContext(conf = conf)
sqlc = pyspark.sql.SQLContext(sc)
data = sqlc.parquetFile(/home/adam/parquet_test)

def getnow():
return int(round(time.time() * 1000))

def applyfunc2(data):
some work which returns a list object

print CHECKPOINT 1: %i % (getnow())
data.cache()
junk = data.map(lambda x: 0).collect() # This part introduced to separate the 
timing of disk load and computation
print CHECKPOINT 2: %i % (getnow())
grouped = data.groupBy(lambda x: x.unique_user_identifier)
print CHECKPOINT 3: %i % (getnow())
calced = grouped.flatMap(applyfunc2)
print CHECKPOINT 4: %i % (getnow())
counts = calced.collect()
print CHECKPOINT 5: %i % (getnow())

 PySpark pickling of pyspark.sql.Row objects is extremely inefficient
 

 Key: SPARK-4315
 URL: https://issues.apache.org/jira/browse/SPARK-4315
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: Ubuntu, Python 2.7, Spark 1.1.0
Reporter: Adam Davison

 Working with an RDD of pyspark.sql.Row objects, created by reading a file 
 with SQLContext in a local PySpark context.
 Operations on the RDD, such as: data.groupBy(lambda x: x.field_name) are 
 extremely slow (more than 10x slower than an equivalent Scala/Spark 
 implementation). Obviously I expected it to be somewhat slower, but I did a 
 bit of digging given the difference was so huge.
 Luckily it's fairly easy to add profiling to the Python workers. I see that 
 the vast majority of time is spent in:
 spark-1.1.0-bin-cdh4/python/pyspark/sql.py:757(_restore_object)
 It seems that this line attempts to accelerate pickling of Rows with the use 
 of a cache. Some debugging reveals that this cache becomes quite big (100s of 
 entries). Disabling the cache by adding:
 return _create_cls(dataType)(obj)
 as the first line of _restore_object made my query run 5x faster. Implying 
 that the caching is not providing the desired acceleration...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4637) spark-1.1.0 does not compile any more


 [ 
https://issues.apache.org/jira/browse/SPARK-4637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4637.
--
Resolution: Duplicate

 spark-1.1.0 does not compile any more
 -

 Key: SPARK-4637
 URL: https://issues.apache.org/jira/browse/SPARK-4637
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.1, 1.1.0
Reporter: Olaf Flebbe
Priority: Critical

 Spark does not compile anymore since the dependency mqtt-client-0.4.0 has 
 been removed from the eclipse repository.  
 See yourself: 
 https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/
  
 and
 {code}
 spark-1.1.0$ grep -C2 mqtt-client ./external/mqtt/pom.xml
 dependency
   groupIdorg.eclipse.paho/groupId
   artifactIdmqtt-client/artifactId
version0.4.0/version
 /dependency
 {code}
 I did not find a different repository providing it. Since I accidentially 
 removed my maven cache I connot compile spark any more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client

2014-11-27 Thread Olaf Flebbe (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227432#comment-14227432
 ] 

Olaf Flebbe commented on SPARK-4632:


The patch uses a SNAPSHOT dependency which is a no-go for Release Builds

 Upgrade MQTT dependency to use latest mqtt-client
 -

 Key: SPARK-4632
 URL: https://issues.apache.org/jira/browse/SPARK-4632
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is 
 breaking Spark build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4170) Closure problems when running Scala app that extends App


[ 
https://issues.apache.org/jira/browse/SPARK-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227431#comment-14227431
 ] 

Apache Spark commented on SPARK-4170:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3497

 Closure problems when running Scala app that extends App
 --

 Key: SPARK-4170
 URL: https://issues.apache.org/jira/browse/SPARK-4170
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Sean Owen
Priority: Minor

 Michael Albert noted this problem on the mailing list 
 (http://apache-spark-user-list.1001560.n3.nabble.com/BUG-when-running-as-quot-extends-App-quot-closures-don-t-capture-variables-td17675.html):
 {code}
 object DemoBug extends App {
 val conf = new SparkConf()
 val sc = new SparkContext(conf)
 val rdd = sc.parallelize(List(A,B,C,D))
 val str1 = A
 val rslt1 = rdd.filter(x = { x != A }).count
 val rslt2 = rdd.filter(x = { str1 != null  x != A }).count
 
 println(DemoBug: rslt1 =  + rslt1 +  rslt2 =  + rslt2)
 }
 {code}
 This produces the output:
 {code}
 DemoBug: rslt1 = 3 rslt2 = 0
 {code}
 If instead there is a proper main(), it works as expected.
 I also this week noticed that in a program which extends App, some values 
 were inexplicably null in a closure. When changing to use main(), it was fine.
 I assume there is a problem with variables not being added to the closure 
 when main() doesn't appear in the standard way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4631) Add real unit test for MQTT

2014-11-27 Thread Jinesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227433#comment-14227433
 ] 

Jinesh commented on SPARK-4631:
---

MQTT is the one of the most popular queuing protcol in IoT.I am from Amrita 
University. In our IoT project, We are using it as a connector between sensor 
plaform and processing environment

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical

 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client

2014-11-27 Thread Olaf Flebbe (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227432#comment-14227432
 ] 

Olaf Flebbe edited comment on SPARK-4632 at 11/27/14 9:34 AM:
--

At lease one of patches linked uses SNAPSHOT dependencies which is a no-go for 
Release Builds


was (Author: oflebbe):
The patch uses a SNAPSHOT dependency which is a no-go for Release Builds

 Upgrade MQTT dependency to use latest mqtt-client
 -

 Key: SPARK-4632
 URL: https://issues.apache.org/jira/browse/SPARK-4632
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is 
 breaking Spark build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4637) spark-1.1.0 does not compile any more

2014-11-27 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-4637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227441#comment-14227441
 ] 

Benjamin Cabé commented on SPARK-4637:
--

bq. I did not find a different repository providing it. Since I accidentially 
removed my maven cache I connot compile spark any more.

FWIW I downloaded 1.1.0 yesterday, and it built just fine, apparently getting 
mqtt 0.4.0 from spring.io repo.
See http://repo.spring.io/webapp/search/artifact/?2q=mqtt and 
http://jcenter.bintray.com/org/eclipse/paho/mqtt-client/0.4.0/

 spark-1.1.0 does not compile any more
 -

 Key: SPARK-4637
 URL: https://issues.apache.org/jira/browse/SPARK-4637
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.1, 1.1.0
Reporter: Olaf Flebbe
Priority: Critical

 Spark does not compile anymore since the dependency mqtt-client-0.4.0 has 
 been removed from the eclipse repository.  
 See yourself: 
 https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/
  
 and
 {code}
 spark-1.1.0$ grep -C2 mqtt-client ./external/mqtt/pom.xml
 dependency
   groupIdorg.eclipse.paho/groupId
   artifactIdmqtt-client/artifactId
version0.4.0/version
 /dependency
 {code}
 I did not find a different repository providing it. Since I accidentially 
 removed my maven cache I connot compile spark any more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4631) Add real unit test for MQTT


[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227472#comment-14227472
 ] 

Prabeesh K commented on SPARK-4631:
---

They fixed the issue now we have data in the 
[repo|https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/]

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical

 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client


[ 
https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227473#comment-14227473
 ] 

Prabeesh K commented on SPARK-4632:
---

They fixed the issue now we have data in the  
[repo|https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/].
 Now the older one works  perfectly

 Upgrade MQTT dependency to use latest mqtt-client
 -

 Key: SPARK-4632
 URL: https://issues.apache.org/jira/browse/SPARK-4632
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is 
 breaking Spark build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4631) Add real unit test for MQTT


[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227506#comment-14227506
 ] 

Prabeesh K commented on SPARK-4631:
---

[~tdas] refer [this 
links|http://dev.eclipse.org/mhonarc/lists/paho-dev/msg02291.html] for the 
discusion going on the paho dev form. 

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical

 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries

2014-11-27 Thread madankumar s (JIRA)

madankumar s created SPARK-4638:
---

 Summary: Spark's MLlib SVM classification to include Kernels like 
Gaussian / (RBF) to find non linear boundaries
 Key: SPARK-4638
 URL: https://issues.apache.org/jira/browse/SPARK-4638
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: madankumar s


SPARK MLlib Classification Module
Add Kernel functionalities to SVM Classifier to find non linear patterns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries

2014-11-27 Thread madankumar s (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

madankumar s updated SPARK-4638:

Description: 
SPARK MLlib Classification Module:
Add Kernel functionalities to SVM Classifier to find non linear patterns

  was:
SPARK MLlib Classification Module
Add Kernel functionalities to SVM Classifier to find non linear patterns


 Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to 
 find non linear boundaries
 ---

 Key: SPARK-4638
 URL: https://issues.apache.org/jira/browse/SPARK-4638
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: madankumar s
  Labels: Gaussian, Kernels, SVM

 SPARK MLlib Classification Module:
 Add Kernel functionalities to SVM Classifier to find non linear patterns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4639) Pass maxIterations in as a parameter in Analyzer

2014-11-27 Thread Jacky Li (JIRA)

Jacky Li created SPARK-4639:
---

 Summary: Pass maxIterations in as a parameter in Analyzer
 Key: SPARK-4639
 URL: https://issues.apache.org/jira/browse/SPARK-4639
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jacky Li
Priority: Minor
 Fix For: 1.3.0


fix a TODO in Analyzer: 
// TODO: pass this in as a parameter
 val fixedPoint = FixedPoint(100)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4639) Pass maxIterations in as a parameter in Analyzer


[ 
https://issues.apache.org/jira/browse/SPARK-4639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227744#comment-14227744
 ] 

Apache Spark commented on SPARK-4639:
-

User 'jackylk' has created a pull request for this issue:
https://github.com/apache/spark/pull/3499

 Pass maxIterations in as a parameter in Analyzer
 

 Key: SPARK-4639
 URL: https://issues.apache.org/jira/browse/SPARK-4639
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jacky Li
Priority: Minor
 Fix For: 1.3.0


 fix a TODO in Analyzer: 
 // TODO: pass this in as a parameter
  val fixedPoint = FixedPoint(100)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4640) FixedRangePartitioner for partitioning items with a known range

2014-11-27 Thread Kevin Mader (JIRA)

Kevin Mader created SPARK-4640:
--

 Summary: FixedRangePartitioner for partitioning items with a known 
range
 Key: SPARK-4640
 URL: https://issues.apache.org/jira/browse/SPARK-4640
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kevin Mader


For the large datasets I work with, it is common to have light-weight keys and 
very heavy values (integers and large double arrays for example). The key 
values are however known and unchanging. It would be nice if Spark had a built 
in partitioner which could take advantage of this. A 
FixedRangePartitioner[T](keys: Seq[T], partitions: Int) would be ideal. 
Furthermore this partitioner type could be extended to a 
PartitionerWithKnownKeys that had a getAllKeys function allowing for a list of 
keys to be obtained without querying through the entire RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4640) FixedRangePartitioner for partitioning items with a known range

2014-11-27 Thread Kevin Mader (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227754#comment-14227754
 ] 

Kevin Mader commented on SPARK-4640:


I have code for both, that I could merge in, if there is interest.

 FixedRangePartitioner for partitioning items with a known range
 ---

 Key: SPARK-4640
 URL: https://issues.apache.org/jira/browse/SPARK-4640
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kevin Mader

 For the large datasets I work with, it is common to have light-weight keys 
 and very heavy values (integers and large double arrays for example). The key 
 values are however known and unchanging. It would be nice if Spark had a 
 built in partitioner which could take advantage of this. A 
 FixedRangePartitioner[T](keys: Seq[T], partitions: Int) would be ideal. 
 Furthermore this partitioner type could be extended to a 
 PartitionerWithKnownKeys that had a getAllKeys function allowing for a list 
 of keys to be obtained without querying through the entire RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4170) Closure problems when running Scala app that extends App

2014-11-27 Thread Aaron Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-4170.
---
Resolution: Fixed
  Assignee: Sean Owen

 Closure problems when running Scala app that extends App
 --

 Key: SPARK-4170
 URL: https://issues.apache.org/jira/browse/SPARK-4170
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor

 Michael Albert noted this problem on the mailing list 
 (http://apache-spark-user-list.1001560.n3.nabble.com/BUG-when-running-as-quot-extends-App-quot-closures-don-t-capture-variables-td17675.html):
 {code}
 object DemoBug extends App {
 val conf = new SparkConf()
 val sc = new SparkContext(conf)
 val rdd = sc.parallelize(List(A,B,C,D))
 val str1 = A
 val rslt1 = rdd.filter(x = { x != A }).count
 val rslt2 = rdd.filter(x = { str1 != null  x != A }).count
 
 println(DemoBug: rslt1 =  + rslt1 +  rslt2 =  + rslt2)
 }
 {code}
 This produces the output:
 {code}
 DemoBug: rslt1 = 3 rslt2 = 0
 {code}
 If instead there is a proper main(), it works as expected.
 I also this week noticed that in a program which extends App, some values 
 were inexplicably null in a closure. When changing to use main(), it was fine.
 I assume there is a problem with variables not being added to the closure 
 when main() doesn't appear in the standard way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks

2014-11-27 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227842#comment-14227842
 ] 

Patrick Wendell commented on SPARK-4598:


Having sorting with pagination seems very difficult to do correctly since we 
rely on javascript for sorting in the frontent. It would be helpful to 
understand the exact memory requirements of serving hundreds of thousands of 
tasks. Where is the memory from? Can we just optimize the use of memory? We 
need to store all of those tasks anyways in int he driver. 

 Paginate stage page to avoid OOM with  100,000 tasks
 -

 Key: SPARK-4598
 URL: https://issues.apache.org/jira/browse/SPARK-4598
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula
Priority: Critical

 In HistoryServer stage page, clicking the task href in Description, it occurs 
 the GC error. The detail error message is:
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-352] | Error for 
 /history/application_1416206401491_0010/stages/stage/ | 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590)
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-364] | handle failed | 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697)
 java.lang.OutOfMemoryError: GC overhead limit exceeded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4634) Enable metrics for each application to be gathered in one node

2014-11-27 Thread Masayoshi TSUZUKI (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227848#comment-14227848
]

Masayoshi TSUZUKI commented on SPARK-4634:
--

Sorry, GraphiteSink has already got the option prefix and it works fine.

Enable metrics for each application to be gathered in one node
--

Key: SPARK-4634
URL: https://issues.apache.org/jira/browse/SPARK-4634
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 1.1.0
Reporter: Masayoshi TSUZUKI

Metrics output is now like this:
{noformat}
- app_1.driver.jvm.somevalue
- app_1.driver.jvm.somevalue
- ...
- app_2.driver.jvm.somevalue
- app_2.driver.jvm.somevalue
- ...
{noformat}
In current spark, application names come to top level,
but we should be able to gather the application names under some top level
node.
For example, think of using graphite.
When we use graphite, the application names are listed as top level node.
Graphite can also collect OS metrics, and OS metrics are able to be put in
some one node.
But the current Spark metrics are not.
So, with the current Spark, the tree structure of metrics shown in graphite
web UI is like this.
{noformat}
- os
- os.node1.somevalue
- os.node2.somevalue
- ...
- app_1
- app_1.driver.jvm.somevalue
- app_1.driver.jvm.somevalue
- ...
- app_2
- ...
- app_3
- ...
{noformat}
We should be able to add some top level name before the application name (top
level name may be cluster name for instance).
If we make the name configurable by *.conf, it might be also convenience in
case that 2 different spark clusters sink metrics to the same graphite server.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4634) Enable metrics for each application to be gathered in one node

2014-11-27 Thread Masayoshi TSUZUKI (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Masayoshi TSUZUKI closed SPARK-4634.

Resolution: Not a Problem

GraphiteSink has already got the option prefix and it works fine.

Enable metrics for each application to be gathered in one node
--

Key: SPARK-4634
URL: https://issues.apache.org/jira/browse/SPARK-4634
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 1.1.0
Reporter: Masayoshi TSUZUKI

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend

2014-11-27 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-4626:
---
Description: 
{code}
26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
OneForOneStrategy - key not found: 0
java.util.NoSuchElementException: key not found: 0
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

This came on the heels of a lot of lost executors with error messages like:
26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated

  was:
26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
OneForOneStrategy - key not found: 0
java.util.NoSuchElementException: key not found: 0
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

This came on the heels of a lot of lost executors with error messages like:
26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated


 NoSuchElementException in CoarseGrainedSchedulerBackend
 ---

 Key: SPARK-4626
 URL: https://issues.apache.org/jira/browse/SPARK-4626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Victor Tso
Assignee: Victor Tso

 {code}
 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
 OneForOneStrategy - key not found: 0
 java.util.NoSuchElementException: key not found: 0
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 This came on the heels of a lot of lost executors with error messages like:
 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
 TaskSchedulerImpl -

[jira] [Resolved] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend

2014-11-27 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4626.

   Resolution: Fixed
Fix Version/s: 1.2.0

 NoSuchElementException in CoarseGrainedSchedulerBackend
 ---

 Key: SPARK-4626
 URL: https://issues.apache.org/jira/browse/SPARK-4626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Victor Tso
Assignee: Victor Tso
 Fix For: 1.2.0


 {code}
 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
 OneForOneStrategy - key not found: 0
 java.util.NoSuchElementException: key not found: 0
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 This came on the heels of a lot of lost executors with error messages like:
 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
 TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend

2014-11-27 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-4626:
---
Description: 
{code}
26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
OneForOneStrategy - key not found: 0
java.util.NoSuchElementException: key not found: 0
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

This came on the heels of a lot of lost executors with error messages like:
{code}
26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated
{code}

  was:
{code}
26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
OneForOneStrategy - key not found: 0
java.util.NoSuchElementException: key not found: 0
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

This came on the heels of a lot of lost executors with error messages like:
26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated


 NoSuchElementException in CoarseGrainedSchedulerBackend
 ---

 Key: SPARK-4626
 URL: https://issues.apache.org/jira/browse/SPARK-4626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Victor Tso
Assignee: Victor Tso
 Fix For: 1.2.0


 {code}
 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
 OneForOneStrategy - key not found: 0
 java.util.NoSuchElementException: key not found: 0
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 This came on the heels of a lot of lost executors with error messages like:
 {code}
 26 Nov 2014 06:38:20,330 ERROR

[jira] [Created] (SPARK-4641) A FileNotFoundException happened in Hash Shuffle Manager

2014-11-27 Thread hzw (JIRA)

hzw created SPARK-4641:
--

 Summary: A FileNotFoundException happened in Hash Shuffle Manager
 Key: SPARK-4641
 URL: https://issues.apache.org/jira/browse/SPARK-4641
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Shuffle
 Environment: A WordCount Example with some special text input(normal 
words text)
Reporter: hzw


Using Hash Shuffle without consolidateFiles, it will throw such exception:
java.io.IOException: Error in reading 
org.apache.spark.network.FileSegmentManagedBuffer .. (actual file length 0)
Caused by: java.io.FileNotFoundException:  (No such file or directory)

And using Hash Shuffle with consolidateFiles, it will throw another exception: 
java.io.IOException: PARSING_ERROR(2)
at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4642) Documents about running-on-YARN needs update

2014-11-27 Thread Masayoshi TSUZUKI (JIRA)

Masayoshi TSUZUKI created SPARK-4642:


 Summary: Documents about running-on-YARN needs update
 Key: SPARK-4642
 URL: https://issues.apache.org/jira/browse/SPARK-4642
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Masayoshi TSUZUKI
Priority: Minor


Documents about running-on-YARN needs update

There are some parameters missing in the document about running-on-YARN page.
We need to add the descriptions about the following parameters:
  - spark.yarn.report.interval
  - spark.yarn.queue
  - spark.yarn.user.classpath.first
  - spark.yarn.scheduler.reporterThread.maxFailures

And the description about this default parameter is not strictly accurate:
  - spark.yarn.submit.file.replication




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4642) Documents about running-on-YARN needs update


[ 
https://issues.apache.org/jira/browse/SPARK-4642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228017#comment-14228017
 ] 

Apache Spark commented on SPARK-4642:
-

User 'tsudukim' has created a pull request for this issue:
https://github.com/apache/spark/pull/3500

 Documents about running-on-YARN needs update
 

 Key: SPARK-4642
 URL: https://issues.apache.org/jira/browse/SPARK-4642
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Masayoshi TSUZUKI
Priority: Minor

 Documents about running-on-YARN needs update
 There are some parameters missing in the document about running-on-YARN page.
 We need to add the descriptions about the following parameters:
   - spark.yarn.report.interval
   - spark.yarn.queue
   - spark.yarn.user.classpath.first
   - spark.yarn.scheduler.reporterThread.maxFailures
 And the description about this default parameter is not strictly accurate:
   - spark.yarn.submit.file.replication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4613) Make JdbcRDD easier to use from Java

2014-11-27 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-4613.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 Make JdbcRDD easier to use from Java
 

 Key: SPARK-4613
 URL: https://issues.apache.org/jira/browse/SPARK-4613
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Cheng Lian
 Fix For: 1.2.0


 We might eventually deprecate it, but for now it would be nice to expose a 
 Java wrapper that allows users to create this using the java function 
 interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4613) Make JdbcRDD easier to use from Java

2014-11-27 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-4613:
-
Issue Type: Improvement  (was: Bug)

 Make JdbcRDD easier to use from Java
 

 Key: SPARK-4613
 URL: https://issues.apache.org/jira/browse/SPARK-4613
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Cheng Lian
 Fix For: 1.2.0


 We might eventually deprecate it, but for now it would be nice to expose a 
 Java wrapper that allows users to create this using the java function 
 interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks

2014-11-27 Thread meiyoula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228043#comment-14228043
 ] 

meiyoula commented on SPARK-4598:
-

Yearh, optimize the use of memory maybe can resolve the problem once, but it's 
not an effective solution. 
Sorting is before pagination, so it has no problem. Using paginationi in 
HistoryServerSparkUI can lower the memory requirements, why don't do this? It 
will be helpful to the spark cluster capabilities and good to spark users.

 Paginate stage page to avoid OOM with  100,000 tasks
 -

 Key: SPARK-4598
 URL: https://issues.apache.org/jira/browse/SPARK-4598
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula
Priority: Critical

 In HistoryServer stage page, clicking the task href in Description, it occurs 
 the GC error. The detail error message is:
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-352] | Error for 
 /history/application_1416206401491_0010/stages/stage/ | 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590)
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-364] | handle failed | 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697)
 java.lang.OutOfMemoryError: GC overhead limit exceeded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4533) SchemaRDD Api error: Can only subtract another SchemaRDD

2014-11-27 Thread Shawn Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Guo updated SPARK-4533:
-
Summary: SchemaRDD Api error: Can only subtract another SchemaRDD  (was: 
Can only subtract another SchemaRDD)

 SchemaRDD Api error: Can only subtract another SchemaRDD
 

 Key: SPARK-4533
 URL: https://issues.apache.org/jira/browse/SPARK-4533
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: JDK6/7
Reporter: Shawn Guo
Priority: Minor

 There are two unexpected validations in below SchemaRDD APIs. 
 subtract(self, other, numPartitions=None)
 Can only subtract another SchemaRDD
 intersection(self, other)
 Can only intersect with another SchemaRDD
 Can only subtract another SchemaRDD will be thrown when SchemaRDD subtract 
 other types of RDD.
 Reproduce Steps:
 A = SchemaRDD
 B = SchemaRDD
 A_APX= A.keyBy(lambda line: None)
 B_APX=B.keyBy(lambda line: None)
 {color:red}
 CROSSED = A_APX.join(B_APX).map(lambda line: line[1]).filter(filter 
 condition).map(lambda line: line[0]))
 {color}
 C=A.subtract(CROSSED)  {color:red}#ERROR:Can only subtract another 
 SchemaRDD{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4533) SchemaRDD API error: Can only subtract another SchemaRDD

2014-11-27 Thread Shawn Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Guo updated SPARK-4533:
-
Summary: SchemaRDD API error: Can only subtract another SchemaRDD  (was: 
SchemaRDD Api error: Can only subtract another SchemaRDD)

 SchemaRDD API error: Can only subtract another SchemaRDD
 

 Key: SPARK-4533
 URL: https://issues.apache.org/jira/browse/SPARK-4533
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: JDK6/7
Reporter: Shawn Guo
Priority: Minor

 There are two unexpected validations in below SchemaRDD APIs. 
 subtract(self, other, numPartitions=None)
 Can only subtract another SchemaRDD
 intersection(self, other)
 Can only intersect with another SchemaRDD
 Can only subtract another SchemaRDD will be thrown when SchemaRDD subtract 
 other types of RDD.
 Reproduce Steps:
 A = SchemaRDD
 B = SchemaRDD
 A_APX= A.keyBy(lambda line: None)
 B_APX=B.keyBy(lambda line: None)
 {color:red}
 CROSSED = A_APX.join(B_APX).map(lambda line: line[1]).filter(filter 
 condition).map(lambda line: line[0]))
 {color}
 C=A.subtract(CROSSED)  {color:red}#ERROR:Can only subtract another 
 SchemaRDD{color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend


[ 
https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228105#comment-14228105
 ] 

Apache Spark commented on SPARK-4626:
-

User 'roxchkplusony' has created a pull request for this issue:
https://github.com/apache/spark/pull/3503

 NoSuchElementException in CoarseGrainedSchedulerBackend
 ---

 Key: SPARK-4626
 URL: https://issues.apache.org/jira/browse/SPARK-4626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Victor Tso
Assignee: Victor Tso
 Fix For: 1.2.0


 {code}
 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
 OneForOneStrategy - key not found: 0
 java.util.NoSuchElementException: key not found: 0
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 This came on the heels of a lot of lost executors with error messages like:
 {code}
 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
 TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend


[ 
https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228103#comment-14228103
 ] 

Apache Spark commented on SPARK-4626:
-

User 'roxchkplusony' has created a pull request for this issue:
https://github.com/apache/spark/pull/3502

 NoSuchElementException in CoarseGrainedSchedulerBackend
 ---

 Key: SPARK-4626
 URL: https://issues.apache.org/jira/browse/SPARK-4626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Victor Tso
Assignee: Victor Tso
 Fix For: 1.2.0


 {code}
 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
 OneForOneStrategy - key not found: 0
 java.util.NoSuchElementException: key not found: 0
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 This came on the heels of a lot of lost executors with error messages like:
 {code}
 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
 TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4643) spark staging repository location outdated


[ 
https://issues.apache.org/jira/browse/SPARK-4643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228108#comment-14228108
 ] 

Apache Spark commented on SPARK-4643:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/3504

 spark staging repository location outdated
 --

 Key: SPARK-4643
 URL: https://issues.apache.org/jira/browse/SPARK-4643
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Adrian Wang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4644) Implement skewed join

2014-11-27 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-4644:
---

 Summary: Implement skewed join
 Key: SPARK-4644
 URL: https://issues.apache.org/jira/browse/SPARK-4644
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu


Skewed data is not rare. For example, a book recommendation site may have 
several books which are liked by most of the users. Running ALS on such skewed 
data will raise a OutOfMemory error, if some book has too many users which 
cannot be fit into memory. To solve it, we propose a skewed join implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4644) Implement skewed join

2014-11-27 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-4644:

Attachment: Skewed Join Design Doc.pdf

The design doc of skewed join

 Implement skewed join
 -

 Key: SPARK-4644
 URL: https://issues.apache.org/jira/browse/SPARK-4644
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
 Attachments: Skewed Join Design Doc.pdf


 Skewed data is not rare. For example, a book recommendation site may have 
 several books which are liked by most of the users. Running ALS on such 
 skewed data will raise a OutOfMemory error, if some book has too many users 
 which cannot be fit into memory. To solve it, we propose a skewed join 
 implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4644) Implement skewed join


[ 
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228113#comment-14228113
 ] 

Apache Spark commented on SPARK-4644:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3505

 Implement skewed join
 -

 Key: SPARK-4644
 URL: https://issues.apache.org/jira/browse/SPARK-4644
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
 Attachments: Skewed Join Design Doc.pdf


 Skewed data is not rare. For example, a book recommendation site may have 
 several books which are liked by most of the users. Running ALS on such 
 skewed data will raise a OutOfMemory error, if some book has too many users 
 which cannot be fit into memory. To solve it, we propose a skewed join 
 implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4641) A FileNotFoundException happened in Hash Shuffle Manager


 [ 
https://issues.apache.org/jira/browse/SPARK-4641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4641.
--
Resolution: Duplicate

 A FileNotFoundException happened in Hash Shuffle Manager
 

 Key: SPARK-4641
 URL: https://issues.apache.org/jira/browse/SPARK-4641
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Shuffle
 Environment: A WordCount Example with some special text input(normal 
 words text)
Reporter: hzw

 Using Hash Shuffle without consolidateFiles, it will throw such exception:
   java.io.IOException: Error in reading 
 org.apache.spark.network.FileSegmentManagedBuffer .. (actual file length 0)
   Caused by: java.io.FileNotFoundException:  (No such file or directory)
 And using Hash Shuffle with consolidateFiles, it will throw another 
 exception: 
 java.io.IOException: PARSING_ERROR(2)
   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4645) Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver

2014-11-27 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-4645:
-

 Summary: Asynchronous execution in HiveThriftServer2 with Hive 
0.13.1 doesn't play well with Simba ODBC driver
 Key: SPARK-4645
 URL: https://issues.apache.org/jira/browse/SPARK-4645
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Lian
Priority: Blocker


Hive 0.13.1 enables asynchronous execution for {{SQLOperation}} by default. So 
does Spark SQL HiveThriftServer2 when built with Hive 0.13.1. This works well 
for normal JDBC clients like BeeLine, but throws exception when using Simba 
ODBC driver.

Simba ODBC driver tries to execute two statement while connecting to Spark SQL 
HiveThriftServer2:

- {{use `default`}}
- {{set -v}}

However, HiveThriftServer2 throws exception when executing them:
{code}
14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error executing query:
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap 
space
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:309)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
at 
org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
at 
org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
at 
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:84)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/11/28 15:18:37 ERROR SparkExecuteStatementOperation: Error running hive 
query: 
org.apache.hive.service.cli.HiveSQLException: 
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Java heap 
space
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$runInternal(Shim13.scala:104)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(Shim13.scala:224)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(Shim13.scala:234)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4646) Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark

2014-11-27 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created SPARK-4646:
---

 Summary: Replace Scala.util.Sorting.quickSort with Sorter(TimSort) 
in Spark
 Key: SPARK-4646
 URL: https://issues.apache.org/jira/browse/SPARK-4646
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Takeshi Yamamuro
Priority: Minor


This patch just replaces a native quick sorter with Sorter(TimSort) in Spark.
It could get performance gains by ~8% in my quick experiments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4646) Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark


[ 
https://issues.apache.org/jira/browse/SPARK-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228129#comment-14228129
 ] 

Apache Spark commented on SPARK-4646:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/3507

 Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark
 --

 Key: SPARK-4646
 URL: https://issues.apache.org/jira/browse/SPARK-4646
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Takeshi Yamamuro
Priority: Minor

 This patch just replaces a native quick sorter with Sorter(TimSort) in Spark.
 It could get performance gains by ~8% in my quick experiments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4645) Asynchronous execution in HiveThriftServer2 with Hive 0.13.1 doesn't play well with Simba ODBC driver