[jira] [Commented] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-11-26 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225886#comment-14225886
 ] 

Xiangrui Meng commented on SPARK-3995:
--

I think we can backport SPARK-4477, which removes numpy from sampling.

 [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
 -

 Key: SPARK-3995
 URL: https://issues.apache.org/jira/browse/SPARK-3995
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Jeremy Freeman
Assignee: Jeremy Freeman
Priority: Critical
 Fix For: 1.2.0


 There is a breaking bug in PySpark's sampling methods when run with NumPy 
 v1.9. This is the version of NumPy included with the current Anaconda 
 distribution (v2.1); this is a popular distribution, and is likely to affect 
 many users.
 Steps to reproduce are:
 {code:python}
 foo = sc.parallelize(range(1000),5)
 foo.takeSample(False, 10)
 {code}
 Returns:
 {code}
 PythonException: Traceback (most recent call last):
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, 
 line 79, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py,
  line 196, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py,
  line 127, in dump_stream
 for obj in iterator:
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py,
  line 185, in _batched
 for item in iterator:
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py,
  line 116, in func
 if self.getUniformSample(split) = self._fraction:
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py,
  line 58, in getUniformSample
 self.initRandomGenerator(split)
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py,
  line 44, in initRandomGenerator
 self._random = numpy.random.RandomState(self._seed)
   File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
 (numpy/random/mtrand/mtrand.c:7397)
   File mtrand.pyx, line 646, in mtrand.RandomState.seed 
 (numpy/random/mtrand/mtrand.c:7697)
 ValueError: Seed must be between 0 and 4294967295
 {code}
 In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:
 {code:python}
 self._seed = seed if seed is not None else random.randint(0, sys.maxint)
 {code}
 In previous versions of NumPy a random seed larger than 2 ** 32 would 
 silently get truncated to 2 ** 32. This was fixed in a recent patch 
 (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
  But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, 
 which effectively breaks sampling operations in PySpark (unless the seed is 
 set manually).
 I am putting a PR together now (the fix is very simple!).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4619) Double ms in ShuffleBlockFetcherIterator log

2014-11-26 Thread maji2014 (JIRA)
maji2014 created SPARK-4619:
---

 Summary: Double ms in ShuffleBlockFetcherIterator log
 Key: SPARK-4619
 URL: https://issues.apache.org/jira/browse/SPARK-4619
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.2
Reporter: maji2014
Priority: Minor


log as followings: 
ShuffleBlockFetcherIterator: Got local blocks in  8 ms ms

reason:
logInfo(Got local blocks in  + Utils.getUsedTimeMs(startTime) +  ms)








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4584) 2x Performance regression for Spark-on-YARN

2014-11-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225899#comment-14225899
 ] 

Sean Owen commented on SPARK-4584:
--

{{SecurityManager}} is something that loads of the JVM code consults, if it 
exists:

{code}
SecurityManager sm = SecurityManager.getSystemSecurityManager();
if (sm != null) {
  ...
}
{code}

Setting any {{SecurityManager}} is like turning on a whole lot of not-cheap 
permission checks throughout the JDK. I think setting one is pretty undesirable 
from a performance perspective. It also precludes the possibility of enabling a 
real SecurityManager for contexts that want to although I find it unlikely that 
would ever really work with Spark.

How about just documenting and telling users don't System.exit in your code, 
which is widely accepted as a no-no in Java/Scala anyway?

 2x Performance regression for Spark-on-YARN
 ---

 Key: SPARK-4584
 URL: https://issues.apache.org/jira/browse/SPARK-4584
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Nishkam Ravi
Assignee: Marcelo Vanzin
Priority: Blocker

 Significant performance regression observed for Spark-on-YARN (upto 2x) after 
 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 
  from Oct 7th. Problem can be reproduced with JavaWordCount against a large 
 enough input dataset in YARN cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4619) Double ms in ShuffleBlockFetcherIterator log

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225910#comment-14225910
 ] 

Apache Spark commented on SPARK-4619:
-

User 'maji2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/3475

 Double ms in ShuffleBlockFetcherIterator log
 --

 Key: SPARK-4619
 URL: https://issues.apache.org/jira/browse/SPARK-4619
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.2
Reporter: maji2014
Priority: Minor

 log as followings: 
 ShuffleBlockFetcherIterator: Got local blocks in  8 ms ms
 reason:
 logInfo(Got local blocks in  + Utils.getUsedTimeMs(startTime) +  ms)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4620) Add unpersist in Graph/GraphImpl

2014-11-26 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-4620:
---

 Summary: Add unpersist in Graph/GraphImpl
 Key: SPARK-4620
 URL: https://issues.apache.org/jira/browse/SPARK-4620
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Takeshi Yamamuro
Priority: Trivial


Add an IF to uncache both vertices and edges of this graph.
This IF is useful when iterative graph operations build a new graph in each 
iteration, and the vertices and edges of previous iterations are no longer 
needed for following iterations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4620) Add unpersist in Graph/GraphImpl

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225924#comment-14225924
 ] 

Apache Spark commented on SPARK-4620:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/3476

 Add unpersist in Graph/GraphImpl
 

 Key: SPARK-4620
 URL: https://issues.apache.org/jira/browse/SPARK-4620
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Takeshi Yamamuro
Priority: Trivial

 Add an IF to uncache both vertices and edges of this graph.
 This IF is useful when iterative graph operations build a new graph in each 
 iteration, and the vertices and edges of previous iterations are no longer 
 needed for following iterations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4620) Add unpersist in Graph/GraphImpl

2014-11-26 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-4620:

Description: 
Add an IF to uncache both vertices and edges of Graph/GraphImpl.
This IF is useful when iterative graph operations build a new graph in each 
iteration, and the vertices and edges of previous iterations are no longer 
needed for following iterations.

  was:
Add an IF to uncache both vertices and edges of this graph.
This IF is useful when iterative graph operations build a new graph in each 
iteration, and the vertices and edges of previous iterations are no longer 
needed for following iterations.


 Add unpersist in Graph/GraphImpl
 

 Key: SPARK-4620
 URL: https://issues.apache.org/jira/browse/SPARK-4620
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Takeshi Yamamuro
Priority: Trivial

 Add an IF to uncache both vertices and edges of Graph/GraphImpl.
 This IF is useful when iterative graph operations build a new graph in each 
 iteration, and the vertices and edges of previous iterations are no longer 
 needed for following iterations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4537) Add 'processing delay' and 'totalDelay' to the metrics reported by the Spark Streaming subsystem

2014-11-26 Thread Gerard Maas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225945#comment-14225945
 ] 

Gerard Maas commented on SPARK-4537:


I was waiting for internal clearance to put a PR out, but this is probably much 
faster. Thanks.

 Add 'processing delay' and 'totalDelay' to the metrics reported by the Spark 
 Streaming subsystem
 

 Key: SPARK-4537
 URL: https://issues.apache.org/jira/browse/SPARK-4537
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Gerard Maas
  Labels: metrics

 As the Spark Streaming tuning guide indicates, the key indicators of a 
 healthy streaming job are:
 - Processing Time
 - Total Delay
 The Spark UI page for the Streaming job [1] shows these two indicators but 
 the metrics source for Spark Streaming (StreamingSource.scala)  [2] does not.
 Adding these metrics will allow external monitoring systems that consume the 
 Spark metrics interface to track these two critical pieces of information on 
 a streaming job performance.
 [1] 
 https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingPage.scala#L127
 [2] 
 https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingSource.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4002) KafkaStreamSuite Kafka input stream case fails on OSX

2014-11-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4002.
--
Resolution: Cannot Reproduce

Yes I haven't seen this in a while so let's mark it CannotReproduce

 KafkaStreamSuite Kafka input stream case fails on OSX
 ---

 Key: SPARK-4002
 URL: https://issues.apache.org/jira/browse/SPARK-4002
 Project: Spark
  Issue Type: Bug
  Components: Streaming
 Environment: Mac OSX 10.9.5.
Reporter: Ryan Williams
 Attachments: unit-tests.log


 [~sowen] mentioned this on spark-dev 
 [here|http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccamassdjs0fmsdc-k-4orgbhbfz2vvrmm0hfyifeeal-spft...@mail.gmail.com%3E]
  and I just reproduced it on {{master}} 
 ([7e63bb4|https://github.com/apache/spark/commit/7e63bb49c526c3f872619ae14e4b5273f4c535e9]).
 The relevant output I get when running {{./dev/run-tests}} is:
 {code}
 [info] KafkaStreamSuite:
 [info] - Kafka input stream *** FAILED ***
 [info]   3 did not equal 0 (KafkaStreamSuite.scala:135)
 [info] Test run started
 [info] Test 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream started
 [error] Test 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream failed: 
 junit.framework.AssertionFailedError: expected:3 but was:0
 [error] at junit.framework.Assert.fail(Assert.java:50)
 [error] at junit.framework.Assert.failNotEquals(Assert.java:287)
 [error] at junit.framework.Assert.assertEquals(Assert.java:67)
 [error] at junit.framework.Assert.assertEquals(Assert.java:199)
 [error] at junit.framework.Assert.assertEquals(Assert.java:205)
 [error] at 
 org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream(JavaKafkaStreamSuite.java:129)
 [error] ...
 [info] Test run finished: 1 failed, 0 ignored, 1 total, 14.451s
 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128M; 
 support was removed in 8.0
 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1g; 
 support was removed in 8.0
 [info] ScalaTest
 [info] Run completed in 11 minutes, 39 seconds.
 [info] Total number of tests run: 1
 [info] Suites: completed 1, aborted 0
 [info] Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0
 [info] *** 1 TEST FAILED ***
 [error] Failed: Total 2, Failed 2, Errors 0, Passed 0
 [error] Failed tests:
 [error]   org.apache.spark.streaming.kafka.JavaKafkaStreamSuite
 [error]   org.apache.spark.streaming.kafka.KafkaStreamSuite
 {code}
 This simplest command I know that reproduces this test failure is:
 {code}
 mvn test -Dsuites='*KafkaStreamSuite'
 {code}
 Often I have to {{mvn clean}} before or as part of running that command, 
 otherwise I get other spurious compile errors or crashes, but that is another 
 story.
 Seems like this test should be {{@Ignore}}'d, or some note about this made on 
 the {{README.md}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4621) when sort- based shuffle, Cache recently used shuffle index can reduce indexFile's io

2014-11-26 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-4621:
---

 Summary: when sort- based shuffle, Cache recently used shuffle 
index can reduce indexFile's io
 Key: SPARK-4621
 URL: https://issues.apache.org/jira/browse/SPARK-4621
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Reporter: Lianhui Wang
Priority: Critical


in IndexShuffleBlockManager, we can use LRUCache to store recently finished 
shuffle index and that can reduce indexFile's io.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4621) when sort- based shuffle, Cache recently finished shuffle index can reduce indexFile's io

2014-11-26 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-4621:

Summary: when sort- based shuffle, Cache recently finished shuffle index 
can reduce indexFile's io  (was: when sort- based shuffle, Cache recently used 
shuffle index can reduce indexFile's io)

 when sort- based shuffle, Cache recently finished shuffle index can reduce 
 indexFile's io
 -

 Key: SPARK-4621
 URL: https://issues.apache.org/jira/browse/SPARK-4621
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Reporter: Lianhui Wang
Priority: Critical

 in IndexShuffleBlockManager, we can use LRUCache to store recently finished 
 shuffle index and that can reduce indexFile's io.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-11-26 Thread zzc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225960#comment-14225960
 ] 

zzc commented on SPARK-2468:


AH, Aaron Davidson, With patch #3465, I can run successful previously failed 
application and my configuration is the same as before. It's great.

 Netty-based block server / client module
 

 Key: SPARK-2468
 URL: https://issues.apache.org/jira/browse/SPARK-2468
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.2.0


 Right now shuffle send goes through the block manager. This is inefficient 
 because it requires loading a block from disk into a kernel buffer, then into 
 a user space buffer, and then back to a kernel send buffer before it reaches 
 the NIC. It does multiple copies of the data and context switching between 
 kernel/user. It also creates unnecessary buffer in the JVM that increases GC
 Instead, we should use FileChannel.transferTo, which handles this in the 
 kernel space with zero-copy. See 
 http://www.ibm.com/developerworks/library/j-zerocopy/
 One potential solution is to use Netty.  Spark already has a Netty based 
 network module implemented (org.apache.spark.network.netty). However, it 
 lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4528) [SQL] add comment support for Spark SQL CLI

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225989#comment-14225989
 ] 

Apache Spark commented on SPARK-4528:
-

User 'tsingfu' has created a pull request for this issue:
https://github.com/apache/spark/pull/3477

 [SQL] add comment support for Spark SQL CLI 
 

 Key: SPARK-4528
 URL: https://issues.apache.org/jira/browse/SPARK-4528
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Fuqing Yang
Priority: Minor
  Labels: features

 while using spark-sql, found it does not support comment while write sqls. 
 It returns an error if we add some comments  
 for examples:
 spark-sql 
   show tables; --list tables in current db;
 14/11/21 11:30:37 INFO parse.ParseDriver: Parsing command: show tables
 14/11/21 11:30:37 INFO parse.ParseDriver: Parse Completed
 14/11/21 11:30:37 INFO analysis.Analyzer: Max iterations (2) reached for 
 batch MultiInstanceRelations
 14/11/21 11:30:37 INFO analysis.Analyzer: Max iterations (2) reached for 
 batch CaseInsensitiveAttributeReferences
 ..
 14/11/21 11:30:38 INFO HiveMetaStore.audit: ugi=hadoop
 ip=unknown-ip-addr  cmd=get_tables: db=default pat=.*   
 14/11/21 11:30:38 INFO ql.Driver: /PERFLOG method=task.DDL.Stage-0 
 start=1416540638191 end=1416540638202 duration=11
 14/11/21 11:30:38 INFO ql.Driver: /PERFLOG method=runTasks 
 start=1416540638191 end=1416540638202 duration=11
 14/11/21 11:30:38 INFO ql.Driver: /PERFLOG method=Driver.execute 
 start=1416540638190 end=1416540638203 duration=13
 OK
 14/11/21 11:30:38 INFO ql.Driver: OK
 14/11/21 11:30:38 INFO ql.Driver: PERFLOG method=releaseLocks
 14/11/21 11:30:38 INFO ql.Driver: /PERFLOG method=releaseLocks 
 start=1416540638203 end=1416540638203 duration=0
 14/11/21 11:30:38 INFO ql.Driver: /PERFLOG method=Driver.run 
 start=1416540637998 end=1416540638203 duration=205
 14/11/21 11:30:38 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 14/11/21 11:30:38 INFO ql.Driver: PERFLOG method=releaseLocks
 14/11/21 11:30:38 INFO ql.Driver: /PERFLOG method=releaseLocks 
 start=1416540638207 end=1416540638208 duration=1
 14/11/21 11:30:38 INFO analysis.Analyzer: Max iterations (2) reached for 
 batch MultiInstanceRelations
 14/11/21 11:30:38 INFO analysis.Analyzer: Max iterations (2) reached for 
 batch CaseInsensitiveAttributeReferences
 14/11/21 11:30:38 INFO analysis.Analyzer: Max iterations (2) reached for 
 batch Check Analysis
 14/11/21 11:30:38 INFO sql.SQLContext$$anon$1: Max iterations (2) reached for 
 batch Add exchange
 14/11/21 11:30:38 INFO sql.SQLContext$$anon$1: Max iterations (2) reached for 
 batch Prepare Expressions
 dummy
 records
 tab_sogou10
 Time taken: 0.412 seconds
 14/11/21 11:30:38 INFO CliDriver: Time taken: 0.412 seconds
 14/11/21 11:30:38 INFO parse.ParseDriver: Parsing command:  --list tables in 
 current db
 NoViableAltException(-1@[])
   at 
 org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:902)
 Comment support is widely using in projects, it is a necessary to add this 
 feature.
 this implitation can be archived in  the source 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.scala.
 We may need support three comment stypes:
 From a ‘#’ character to the end of the line.
 From a ‘-- ’ sequence to the end of the line
 From a /* sequence to the following */ sequence, as in the C programming 
 language.
 This syntax allows a comment to extend over multiple lines because the 
 beginning and closing sequences need not be on the same line.
 what do you think about?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4613) Make JdbcRDD easier to use from Java

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226007#comment-14226007
 ] 

Apache Spark commented on SPARK-4613:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/3478

 Make JdbcRDD easier to use from Java
 

 Key: SPARK-4613
 URL: https://issues.apache.org/jira/browse/SPARK-4613
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Cheng Lian

 We might eventually deprecate it, but for now it would be nice to expose a 
 Java wrapper that allows users to create this using the java function 
 interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3546) InputStream of ManagedBuffer is not closed and causes running out of file descriptor

2014-11-26 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3546:
--
Target Version/s: 1.2.0  (was: 1.1.1, 1.2.0)

 InputStream of ManagedBuffer is not closed and causes running out of file 
 descriptor
 

 Key: SPARK-3546
 URL: https://issues.apache.org/jira/browse/SPARK-3546
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Critical
 Fix For: 1.2.0


 If application has lots of shuffle blocks, resource leak (running out of file 
 descriptor) is occurred.
 Following text is file descriptors of an Executor which has lots of blocks
 {code}
 ・・・
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9980 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/13/shuffle_0_340_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9981 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/37/shuffle_0_355_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9982 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/10/shuffle_0_370_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9983 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0c/shuffle_0_385_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9984 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0e/shuffle_0_390_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9985 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/1d/shuffle_0_405_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9986 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/36/shuffle_0_420_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9987 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/1b/shuffle_0_425_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9988 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0b/shuffle_0_430_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9989 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0d/shuffle_0_450_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:11 2014 999 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/28/shuffle_1_630_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9990 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/29/shuffle_0_465_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9991 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/14/shuffle_0_495_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9992 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/3c/shuffle_0_525_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9993 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/2a/shuffle_0_530_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9994 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/05/shuffle_0_535_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9995 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/15/shuffle_0_540_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9996 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/2c/shuffle_0_550_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9997 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/13/shuffle_0_560_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9998 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/12/shuffle_0_570_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:27 2014  - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/2f/shuffle_0_580_0.data
 {code}
 This is caused by not closing InputStream generated by ManagedBuffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-3546) InputStream of ManagedBuffer is not closed and causes running out of file descriptor

2014-11-26 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226051#comment-14226051
 ] 

Kousuke Saruta commented on SPARK-3546:
---

[~andrewor14] I don't think we need this for branch-1.1 because ManagedBuffer 
is not implemented in the branch.
I'll remove branch-1.1 from the target version.

 InputStream of ManagedBuffer is not closed and causes running out of file 
 descriptor
 

 Key: SPARK-3546
 URL: https://issues.apache.org/jira/browse/SPARK-3546
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Critical
 Fix For: 1.2.0


 If application has lots of shuffle blocks, resource leak (running out of file 
 descriptor) is occurred.
 Following text is file descriptors of an Executor which has lots of blocks
 {code}
 ・・・
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9980 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/13/shuffle_0_340_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9981 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/37/shuffle_0_355_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9982 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/10/shuffle_0_370_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9983 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0c/shuffle_0_385_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9984 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0e/shuffle_0_390_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9985 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/1d/shuffle_0_405_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9986 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/36/shuffle_0_420_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9987 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/1b/shuffle_0_425_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9988 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0b/shuffle_0_430_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9989 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/0d/shuffle_0_450_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:11 2014 999 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/28/shuffle_1_630_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9990 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/29/shuffle_0_465_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9991 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/14/shuffle_0_495_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9992 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/3c/shuffle_0_525_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9993 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/2a/shuffle_0_530_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9994 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/05/shuffle_0_535_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9995 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/15/shuffle_0_540_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9996 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/2c/shuffle_0_550_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9997 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/13/shuffle_0_560_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:28 2014 9998 - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/12/shuffle_0_570_0.data
 lr-x-- 1 yarn yarn 64  9月 16 20:27 2014  - 
 /hadoop/yarn/local/usercache/kou/appcache/application_1410858801629_0012/spark-local-20140916200509-7444/2f/shuffle_0_580_0.data
 {code}
 This 

[jira] [Commented] (SPARK-4615) Cannot disconnect from spark-shell

2014-11-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226097#comment-14226097
 ] 

Sean Owen commented on SPARK-4615:
--

I can't reproduce this, although I'm using Java 7, Ubuntu, and HEAD. Same on OS 
X, Java 8.
Maybe you can kill -QUIT the Java process to see what threads are waiting on 
what?
Is any of your code starting a thread or adding a shutdown hook?

 Cannot disconnect from spark-shell
 --

 Key: SPARK-4615
 URL: https://issues.apache.org/jira/browse/SPARK-4615
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
 Environment: Ubuntu 12/14
Reporter: Thomas Omans
Priority: Minor

 When running spark-shell using the `v1.2.0-snapshot1` tag, using the 
 instructions at: http://spark.apache.org/docs/latest/building-with-maven.html
 When attemping to disconnect from a spark shell (C-c) the terminal locks and 
 does not interrupt the process.
 The spark-shell was built with:
 ./make-distribution.sh --tgz -Pyarn -Phive -Phadoop-2.3 
 -Dhadoop.version=2.3.0-cdh5.1.0 -DskipTests
 Using oracle jdk6  maven 3.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4622) Add the some error infomation if using spark-sql in yarn-cluster mode

2014-11-26 Thread hzw (JIRA)
hzw created SPARK-4622:
--

 Summary: Add the some error infomation if using spark-sql in 
yarn-cluster mode
 Key: SPARK-4622
 URL: https://issues.apache.org/jira/browse/SPARK-4622
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: hzw


If using spark-sql in yarn-cluster mode, print an error infomation just as the 
spark shell in yarn-cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4623) Add the some error infomation if using spark-sql in yarn-cluster mode

2014-11-26 Thread hzw (JIRA)
hzw created SPARK-4623:
--

 Summary: Add the some error infomation if using spark-sql in 
yarn-cluster mode
 Key: SPARK-4623
 URL: https://issues.apache.org/jira/browse/SPARK-4623
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: hzw


If using spark-sql in yarn-cluster mode, print an error infomation just as the 
spark shell in yarn-cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4624) Errors when reading/writtign to S3 large object files

2014-11-26 Thread Kriton Tsintaris (JIRA)
Kriton Tsintaris created SPARK-4624:
---

 Summary: Errors when reading/writtign to S3 large object files
 Key: SPARK-4624
 URL: https://issues.apache.org/jira/browse/SPARK-4624
 Project: Spark
  Issue Type: Bug
  Components: EC2, Input/Output, Mesos
Affects Versions: 1.1.0
 Environment: manually setup Mesos cluster in EC2 made of 30 c3.4xLArge 
Nodes
Reporter: Kriton Tsintaris
Priority: Critical


My cluster is not configured to use hdfs. Instead the local disk of each node 
is used.

I've got a number of huge RDD object files (each made of ~600 part files each 
of ~60 GB). They are updated extremely rarely.

An example of the model of the data stored in these RDDs is the following: 
(Long, Array[Long]). 

When I load them to my cluster, using val page_users = 
sc.objectFile[(Long,Array[Long])](s3n://mybucket/path/myrdd.obj.rdd) or 
equivelant, sometimes data is missing (as if 1 or 2 of the part files was not 
sucesfuly loaded).
What is more frustrating is that I get no errors that this has happened! 
Sometimes reading s3 timeouts or gets some errors but eventually auto-retries 
do succeed.

Furthermore If I attempt to write an RDD back into S3, using 
myrdd.saveAsObjectFile(s3n://...), the operation will again terminate before 
it was completed without any warning or indication of error.
More specifically what will happen is that the object files parts will be left 
under a _temporary folder and only a few of them will have been moved in the 
correct path in s3. This only happens when I am writing huge object files. If 
my object file is just a few GB everything will be fine. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2192) Examples Data Not in Binary Distribution

2014-11-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226151#comment-14226151
 ] 

Sean Owen commented on SPARK-2192:
--

Oops, on further inspection I see that the file is not Movielens data, but 
merely in the same format. The comments do say this in MovieLensALS.scala. I'll 
cook up a PR to add the example data to the distro.

 Examples Data Not in Binary Distribution
 

 Key: SPARK-2192
 URL: https://issues.apache.org/jira/browse/SPARK-2192
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Pat McDonough

 The data used by examples is not packaged up with the binary distribution. 
 The data subdirectory of spark should make it's way in to the distribution 
 somewhere so the examples can use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2192) Examples Data Not in Binary Distribution

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226175#comment-14226175
 ] 

Apache Spark commented on SPARK-2192:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3480

 Examples Data Not in Binary Distribution
 

 Key: SPARK-2192
 URL: https://issues.apache.org/jira/browse/SPARK-2192
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Pat McDonough

 The data used by examples is not packaged up with the binary distribution. 
 The data subdirectory of spark should make it's way in to the distribution 
 somewhere so the examples can use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4621) when sort- based shuffle, Cache recently finished shuffle index can reduce indexFile's io

2014-11-26 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-4621:

Priority: Minor  (was: Critical)

 when sort- based shuffle, Cache recently finished shuffle index can reduce 
 indexFile's io
 -

 Key: SPARK-4621
 URL: https://issues.apache.org/jira/browse/SPARK-4621
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Reporter: Lianhui Wang
Priority: Minor

 in IndexShuffleBlockManager, we can use LRUCache to store recently finished 
 shuffle index and that can reduce indexFile's io.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4196) Streaming + checkpointing + saveAsNewAPIHadoopFiles = NotSerializableException for Hadoop Configuration

2014-11-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226192#comment-14226192
 ] 

Sean Owen commented on SPARK-4196:
--

Looks good. At least, this commit got me past this particular issue. Thanks!

 Streaming + checkpointing + saveAsNewAPIHadoopFiles = 
 NotSerializableException for Hadoop Configuration
 ---

 Key: SPARK-4196
 URL: https://issues.apache.org/jira/browse/SPARK-4196
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Sean Owen
Assignee: Tathagata Das
 Fix For: 1.2.0, 1.1.2


 I am reasonably sure there is some issue here in Streaming and that I'm not 
 missing something basic, but not 100%. I went ahead and posted it as a JIRA 
 to track, since it's come up a few times before without resolution, and right 
 now I can't get checkpointing to work at all.
 When Spark Streaming checkpointing is enabled, I see a 
 NotSerializableException thrown for a Hadoop Configuration object, and it 
 seems like it is not one from my user code.
 Before I post my particular instance see 
 http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1408135046777-12202.p...@n3.nabble.com%3E
  for another occurrence.
 I was also on customer site last week debugging an identical issue with 
 checkpointing in a Scala-based program and they also could not enable 
 checkpointing without hitting exactly this error.
 The essence of my code is:
 {code}
 final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
 JavaStreamingContextFactory streamingContextFactory = new
 JavaStreamingContextFactory() {
   @Override
   public JavaStreamingContext create() {
 return new JavaStreamingContext(sparkContext, new
 Duration(batchDurationMS));
   }
 };
   streamingContext = JavaStreamingContext.getOrCreate(
   checkpointDirString, sparkContext.hadoopConfiguration(),
 streamingContextFactory, false);
   streamingContext.checkpoint(checkpointDirString);
 {code}
 It yields:
 {code}
 2014-10-31 14:29:00,211 ERROR OneForOneStrategy:66
 org.apache.hadoop.conf.Configuration
 - field (class 
 org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9,
 name: conf$2, type: class org.apache.hadoop.conf.Configuration)
 - object (class
 org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9,
 function2)
 - field (class org.apache.spark.streaming.dstream.ForEachDStream,
 name: org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc,
 type: interface scala.Function2)
 - object (class org.apache.spark.streaming.dstream.ForEachDStream,
 org.apache.spark.streaming.dstream.ForEachDStream@cb8016a)
 ...
 {code}
 This looks like it's due to PairRDDFunctions, as this saveFunc seems
 to be  org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9
 :
 {code}
 def saveAsNewAPIHadoopFiles(
 prefix: String,
 suffix: String,
 keyClass: Class[_],
 valueClass: Class[_],
 outputFormatClass: Class[_ : NewOutputFormat[_, _]],
 conf: Configuration = new Configuration
   ) {
   val saveFunc = (rdd: RDD[(K, V)], time: Time) = {
 val file = rddToFileName(prefix, suffix, time)
 rdd.saveAsNewAPIHadoopFile(file, keyClass, valueClass,
 outputFormatClass, conf)
   }
   self.foreachRDD(saveFunc)
 }
 {code}
 Is that not a problem? but then I don't know how it would ever work in Spark. 
 But then again I don't see why this is an issue and only when checkpointing 
 is enabled.
 Long-shot, but I wonder if it is related to closure issues like 
 https://issues.apache.org/jira/browse/SPARK-1866



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4625) Support Sort By in both DSL SimpleSQLParser

2014-11-26 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-4625:


 Summary: Support Sort By in both DSL  SimpleSQLParser
 Key: SPARK-4625
 URL: https://issues.apache.org/jira/browse/SPARK-4625
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Cheng Hao






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4625) Support Sort By in both DSL SimpleSQLParser

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226280#comment-14226280
 ] 

Apache Spark commented on SPARK-4625:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/3481

 Support Sort By in both DSL  SimpleSQLParser
 ---

 Key: SPARK-4625
 URL: https://issues.apache.org/jira/browse/SPARK-4625
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Cheng Hao





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4612) Configuration object gets created for every task even if not new file/jar is added

2014-11-26 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-4612.
--
  Resolution: Fixed
   Fix Version/s: 1.2.0
Target Version/s: 1.2.0  (was: 1.3.0)

 Configuration object gets created for every task even if not new file/jar is 
 added
 --

 Key: SPARK-4612
 URL: https://issues.apache.org/jira/browse/SPARK-4612
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical
 Fix For: 1.2.0


 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L337
  creates a configuration object for every task that is launched, even if 
 there is no new dependent file/JAR to update. This is a heavy-weight creation 
 that should be avoided if there is not new file/JAR to update. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4046) Incorrect Java example on site

2014-11-26 Thread Brennon York (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brennon York updated SPARK-4046:

Attachment: SPARK-4046.diff

Pulled down the site at the [Apache 
repo|https://svn.apache.org/repos/asf/spark/site], made the change, and 
attached the .diff file to resolve the issue.

 Incorrect Java example on site
 --

 Key: SPARK-4046
 URL: https://issues.apache.org/jira/browse/SPARK-4046
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Java API
Affects Versions: 1.1.0
 Environment: Web
Reporter: Ian Babrou
Priority: Minor
 Attachments: SPARK-4046.diff


 https://spark.apache.org/examples.html
 Here word count example for java is incorrect. It should be mapToPair instead 
 of map. Correct example is here:
 https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaWordCount.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4604) Make MatrixFactorizationModel constructor public

2014-11-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4604.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3473
[https://github.com/apache/spark/pull/3473]

 Make MatrixFactorizationModel constructor public
 

 Key: SPARK-4604
 URL: https://issues.apache.org/jira/browse/SPARK-4604
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.2.0


 Make MatrixFactorizationModel public and add note about partitioning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4614) Slight API changes in Matrix and Matrices

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226396#comment-14226396
 ] 

Apache Spark commented on SPARK-4614:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/3482

 Slight API changes in Matrix and Matrices
 -

 Key: SPARK-4614
 URL: https://issues.apache.org/jira/browse/SPARK-4614
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Before we have a full picture of the operators we want to add, it might be 
 safer to hide `Matrix.transposeMultiply` in 1.2.0. Another update we want to 
 change is `Matrix.randn` and `Matrix.rand`, both of which should take a 
 Random implementation. Otherwise, it is very likely to produce inconsistent 
 RDDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend

2014-11-26 Thread Victor Tso (JIRA)
Victor Tso created SPARK-4626:
-

 Summary: NoSuchElementException in CoarseGrainedSchedulerBackend
 Key: SPARK-4626
 URL: https://issues.apache.org/jira/browse/SPARK-4626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Victor Tso
Priority: Blocker


26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
OneForOneStrategy - key not found: 0
java.util.NoSuchElementException: key not found: 0
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

This came on the heels of a lot of lost executors with error messages like:
26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend

2014-11-26 Thread Victor Tso (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226466#comment-14226466
 ] 

Victor Tso commented on SPARK-4626:
---

Apologies if Blocker is the wrong level. We're seeing this in production so it 
has us jittery.

 NoSuchElementException in CoarseGrainedSchedulerBackend
 ---

 Key: SPARK-4626
 URL: https://issues.apache.org/jira/browse/SPARK-4626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Victor Tso
Priority: Blocker

 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
 OneForOneStrategy - key not found: 0
 java.util.NoSuchElementException: key not found: 0
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This came on the heels of a lot of lost executors with error messages like:
 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
 TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend

2014-11-26 Thread Victor Tso (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Tso updated SPARK-4626:
--
Affects Version/s: (was: 1.1.0)
   1.0.2

 NoSuchElementException in CoarseGrainedSchedulerBackend
 ---

 Key: SPARK-4626
 URL: https://issues.apache.org/jira/browse/SPARK-4626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Victor Tso
Priority: Blocker

 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
 OneForOneStrategy - key not found: 0
 java.util.NoSuchElementException: key not found: 0
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This came on the heels of a lot of lost executors with error messages like:
 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
 TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend

2014-11-26 Thread Victor Tso (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Tso updated SPARK-4626:
--
Priority: Major  (was: Blocker)

 NoSuchElementException in CoarseGrainedSchedulerBackend
 ---

 Key: SPARK-4626
 URL: https://issues.apache.org/jira/browse/SPARK-4626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Victor Tso

 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
 OneForOneStrategy - key not found: 0
 java.util.NoSuchElementException: key not found: 0
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This came on the heels of a lot of lost executors with error messages like:
 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
 TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend

2014-11-26 Thread Victor Tso (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Tso updated SPARK-4626:
--
Comment: was deleted

(was: Apologies if Blocker is the wrong level. We're seeing this in production 
so it has us jittery.)

 NoSuchElementException in CoarseGrainedSchedulerBackend
 ---

 Key: SPARK-4626
 URL: https://issues.apache.org/jira/browse/SPARK-4626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Victor Tso

 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
 OneForOneStrategy - key not found: 0
 java.util.NoSuchElementException: key not found: 0
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This came on the heels of a lot of lost executors with error messages like:
 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
 TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend

2014-11-26 Thread Victor Tso (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226480#comment-14226480
 ] 

Victor Tso commented on SPARK-4626:
---

It looks like a race where the KillTask message comes in shortly after a 
RemoveExecutor message. If the executor has already been removed KillTask 
should no-op:
  executorActor.get(executorId).foreach(_ ! KillTask(taskId, executorId, 
interruptThread))

 NoSuchElementException in CoarseGrainedSchedulerBackend
 ---

 Key: SPARK-4626
 URL: https://issues.apache.org/jira/browse/SPARK-4626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Victor Tso

 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
 OneForOneStrategy - key not found: 0
 java.util.NoSuchElementException: key not found: 0
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This came on the heels of a lot of lost executors with error messages like:
 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
 TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4584) 2x Performance regression for Spark-on-YARN

2014-11-26 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226506#comment-14226506
 ] 

Marcelo Vanzin commented on SPARK-4584:
---

[~sowen] that was going to be my suggestion. For 1.2, I'll just remote the 
security manager and declare use System.exit() at your own risk - the 
behavior should then be the same as 1.1. Post 1.2, we could add some new 
exception (e.g. {{SparkAppException}}) that users can throw if they want the 
runtime to exit with a specific error code. But even that I don't think is 
strictly necessary - just a nice to have.

BTW, we've tried to extend the security manager so that all operations except 
for {{checkExit}} are no-ops, but even that doesn't help.

 2x Performance regression for Spark-on-YARN
 ---

 Key: SPARK-4584
 URL: https://issues.apache.org/jira/browse/SPARK-4584
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Nishkam Ravi
Assignee: Marcelo Vanzin
Priority: Blocker

 Significant performance regression observed for Spark-on-YARN (upto 2x) after 
 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 
  from Oct 7th. Problem can be reproduced with JavaWordCount against a large 
 enough input dataset in YARN cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend

2014-11-26 Thread Victor Tso (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226513#comment-14226513
 ] 

Victor Tso commented on SPARK-4626:
---

https://github.com/apache/spark/pull/3483

 NoSuchElementException in CoarseGrainedSchedulerBackend
 ---

 Key: SPARK-4626
 URL: https://issues.apache.org/jira/browse/SPARK-4626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Victor Tso

 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
 OneForOneStrategy - key not found: 0
 java.util.NoSuchElementException: key not found: 0
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This came on the heels of a lot of lost executors with error messages like:
 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
 TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226514#comment-14226514
 ] 

Apache Spark commented on SPARK-4626:
-

User 'roxchkplusony' has created a pull request for this issue:
https://github.com/apache/spark/pull/3483

 NoSuchElementException in CoarseGrainedSchedulerBackend
 ---

 Key: SPARK-4626
 URL: https://issues.apache.org/jira/browse/SPARK-4626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Victor Tso

 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
 OneForOneStrategy - key not found: 0
 java.util.NoSuchElementException: key not found: 0
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This came on the heels of a lot of lost executors with error messages like:
 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
 TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib

2014-11-26 Thread Jacky Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226562#comment-14226562
 ] 

Jacky Li commented on SPARK-4001:
-

Thanks for your suggestion, Daniel. 
Here is the current status.
1. Currently I have implemented apriori and fp-growth by referring to YAFIM 
(http://pasa-bigdata.nju.edu.cn/people/ronggu/pub/YAFIM_ParLearning.pdf) and 
PFP (http://dl.acm.org/citation.cfm?id=1454027) 
For apriori, currently there are two versions implemented, one using broadcast 
variable and another one using cartisian join of two RDD, I am testing them 
using mushroom and webdoc open dataset (http://fimi.ua.ac.be/data/) to check 
the performance of them before deciding which one to contribute to MLlib.
I have updated the code in the PR (https://github.com/apache/spark/pull/2847), 
you are welcome to check it and try in your use case.
2. For the input part, currently the apriori algo is taking  
{{RDD\[Array\[String\]\]}} as the input dataset, but not containing basket_id 
or user_id. I am not sure whether it can easily fit into your use case. Can you 
give more detail of how you want to use it in collaborative filtering contexts? 


 Add Apriori algorithm to Spark MLlib
 

 Key: SPARK-4001
 URL: https://issues.apache.org/jira/browse/SPARK-4001
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Jacky Li
Assignee: Jacky Li

 Apriori is the classic algorithm for frequent item set mining in a 
 transactional data set.  It will be useful if Apriori algorithm is added to 
 MLLib in Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4584) 2x Performance regression for Spark-on-YARN

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226637#comment-14226637
 ] 

Apache Spark commented on SPARK-4584:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/3484

 2x Performance regression for Spark-on-YARN
 ---

 Key: SPARK-4584
 URL: https://issues.apache.org/jira/browse/SPARK-4584
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Nishkam Ravi
Assignee: Marcelo Vanzin
Priority: Blocker

 Significant performance regression observed for Spark-on-YARN (upto 2x) after 
 1.2 rebase. The offending commit is: 70e824f750aa8ed446eec104ba158b0503ba58a9 
  from Oct 7th. Problem can be reproduced with JavaWordCount against a large 
 enough input dataset in YARN cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4626) NoSuchElementException in CoarseGrainedSchedulerBackend

2014-11-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4626:
-
Assignee: Victor Tso

 NoSuchElementException in CoarseGrainedSchedulerBackend
 ---

 Key: SPARK-4626
 URL: https://issues.apache.org/jira/browse/SPARK-4626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Victor Tso
Assignee: Victor Tso

 26 Nov 2014 06:38:21,330 ERROR [spark-akka.actor.default-dispatcher-22] 
 OneForOneStrategy - key not found: 0
 java.util.NoSuchElementException: key not found: 0
 at scala.collection.MapLike$class.default(MapLike.scala:228)
 at scala.collection.AbstractMap.default(Map.scala:58)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:106)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 This came on the heels of a lot of lost executors with error messages like:
 26 Nov 2014 06:38:20,330 ERROR [spark-akka.actor.default-dispatcher-15] 
 TaskSchedulerImpl - Lost executor 31 on xxx: remote Akka client disassociated



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4627) Too many TripletFields

2014-11-26 Thread Joseph E. Gonzalez (JIRA)
Joseph E. Gonzalez created SPARK-4627:
-

 Summary: Too many TripletFields
 Key: SPARK-4627
 URL: https://issues.apache.org/jira/browse/SPARK-4627
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Joseph E. Gonzalez
Priority: Trivial


The `TripletFields` class defines a set of constants for all possible 
configurations of the triplet fields.  However, many are not useful and as 
result the API is slightly confusing.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4627) Too many TripletFields

2014-11-26 Thread Joseph E. Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph E. Gonzalez closed SPARK-4627.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Fixed in PR #3472.

 Too many TripletFields
 --

 Key: SPARK-4627
 URL: https://issues.apache.org/jira/browse/SPARK-4627
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Joseph E. Gonzalez
Priority: Trivial
 Fix For: 1.2.0


 The `TripletFields` class defines a set of constants for all possible 
 configurations of the triplet fields.  However, many are not useful and as 
 result the API is slightly confusing.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4627) Too many TripletFields

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226652#comment-14226652
 ] 

Apache Spark commented on SPARK-4627:
-

User 'jegonzal' has created a pull request for this issue:
https://github.com/apache/spark/pull/3472

 Too many TripletFields
 --

 Key: SPARK-4627
 URL: https://issues.apache.org/jira/browse/SPARK-4627
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Joseph E. Gonzalez
Priority: Trivial
 Fix For: 1.2.0


 The `TripletFields` class defines a set of constants for all possible 
 configurations of the triplet fields.  However, many are not useful and as 
 result the API is slightly confusing.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4628) Put all external projects behind a build flag

2014-11-26 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-4628:
--

 Summary: Put all external projects behind a build flag
 Key: SPARK-4628
 URL: https://issues.apache.org/jira/browse/SPARK-4628
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Wendell
Priority: Blocker


This is something we talked about doing for convenience, but I'm escalating 
this based on realizing today that some of our external projects depend on code 
that is not in maven central. I.e. if one of these dependencies is taken down 
(as happened recently with mqtt), all Spark builds will fail.

The proposal here is simple, have a profile -Pexternal-projects that enables 
these. This can follow the exact pattern of -Pkinesis-asl which was disabled by 
default due to a license issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4628) Put all external projects behind a build flag

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226691#comment-14226691
 ] 

Apache Spark commented on SPARK-4628:
-

User 'pwendell' has created a pull request for this issue:
https://github.com/apache/spark/pull/3485

 Put all external projects behind a build flag
 -

 Key: SPARK-4628
 URL: https://issues.apache.org/jira/browse/SPARK-4628
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Wendell
Priority: Blocker

 This is something we talked about doing for convenience, but I'm escalating 
 this based on realizing today that some of our external projects depend on 
 code that is not in maven central. I.e. if one of these dependencies is taken 
 down (as happened recently with mqtt), all Spark builds will fail.
 The proposal here is simple, have a profile -Pexternal-projects that enables 
 these. This can follow the exact pattern of -Pkinesis-asl which was disabled 
 by default due to a license issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4614) Slight API changes in Matrix and Matrices

2014-11-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4614.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3482
[https://github.com/apache/spark/pull/3482]

 Slight API changes in Matrix and Matrices
 -

 Key: SPARK-4614
 URL: https://issues.apache.org/jira/browse/SPARK-4614
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.2.0


 Before we have a full picture of the operators we want to add, it might be 
 safer to hide `Matrix.transposeMultiply` in 1.2.0. Another update we want to 
 change is `Matrix.randn` and `Matrix.rand`, both of which should take a 
 Random implementation. Otherwise, it is very likely to produce inconsistent 
 RDDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2450) Provide link to YARN executor logs on UI

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226706#comment-14226706
 ] 

Apache Spark commented on SPARK-2450:
-

User 'ksakellis' has created a pull request for this issue:
https://github.com/apache/spark/pull/3486

 Provide link to YARN executor logs on UI
 

 Key: SPARK-2450
 URL: https://issues.apache.org/jira/browse/SPARK-2450
 Project: Spark
  Issue Type: Improvement
  Components: Web UI, YARN
Affects Versions: 1.0.0
Reporter: Bill Havanki
Assignee: Kostas Sakellis
Priority: Minor

 When running under YARN, provide links to executor logs from the web UI to 
 avoid the need to drill down through the YARN UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4376) Put external modules behind build profiles

2014-11-26 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-4376.
---
Resolution: Duplicate

 Put external modules behind build profiles
 --

 Key: SPARK-4376
 URL: https://issues.apache.org/jira/browse/SPARK-4376
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell
Assignee: Sandy Ryza
Priority: Blocker

 Several people have asked me whether, to speed up the build, we can put the 
 external projects behind build flags similar to the kinesis-asl module. Since 
 these aren't in the assembly there isn't a great reason to build them by 
 default. We can just modify our release script to build them and when we run 
 tests.
 This doesn't technically block Spark 1.2 but it is going to be looped into a 
 separate fix that does block Spark 1.2 so I'm upgrading it to blocker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4628) Put all external projects behind a build flag

2014-11-26 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226722#comment-14226722
 ] 

Sandy Ryza commented on SPARK-4628:
---

This looks like a duplicate of SPARK-4376.  Resolving that one as duplicate 
because it has no PR.

 Put all external projects behind a build flag
 -

 Key: SPARK-4628
 URL: https://issues.apache.org/jira/browse/SPARK-4628
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Wendell
Priority: Blocker

 This is something we talked about doing for convenience, but I'm escalating 
 this based on realizing today that some of our external projects depend on 
 code that is not in maven central. I.e. if one of these dependencies is taken 
 down (as happened recently with mqtt), all Spark builds will fail.
 The proposal here is simple, have a profile -Pexternal-projects that enables 
 these. This can follow the exact pattern of -Pkinesis-asl which was disabled 
 by default due to a license issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4001) Add Apriori algorithm to Spark MLlib

2014-11-26 Thread Daniel Erenrich (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226793#comment-14226793
 ] 

Daniel Erenrich commented on SPARK-4001:


First a minor point: I'd suggest making it take an RDD of arrays of ints? 
Holding tons of strings around seems wasteful. The user can maintain a map from 
strings to ints. Or else can we just make the array contain comparables?

So the main issue is whether we are trying to do frequent itemset mining or 
association rule construction. I argue the latter is more common an operation 
and while the second requires the first there's no great extra cost to doing 
both. I'm actually unfamiliar with what can be done with just the former and 
not the latter.

If you equate baskets with users the connection between association rules and 
collaborative filtering becomes very clear.  I want to feed in someone's, say, 
movie viewing history and get reccomendations of the form you watched X and Y 
so you'll really like Z (where X,Y,Z is a frequent itemset).

The API could be made to match. Give me all the things this user bought in the 
format I described above and the prediction mode is here are all of things 
this person bought please apply as many rules as you can. If a user does care 
about the frequent item sets that would be additionally stored inside the model.

The alternative here is to make a set of frequent itemset miners and then make 
an association rule learner that takes their output. The only downside is that 
that suffers some perf loss (requiring an additional pass). I'll gladly write 
this version if we decide that's the way we should go.

 Add Apriori algorithm to Spark MLlib
 

 Key: SPARK-4001
 URL: https://issues.apache.org/jira/browse/SPARK-4001
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Jacky Li
Assignee: Jacky Li

 Apriori is the classic algorithm for frequent item set mining in a 
 transactional data set.  It will be useful if Apriori algorithm is added to 
 MLLib in Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4629) Spark SQL uses Hadoop Configuration in a thread-unsafe manner when writing Parquet files

2014-11-26 Thread Michael Allman (JIRA)
Michael Allman created SPARK-4629:
-

 Summary: Spark SQL uses Hadoop Configuration in a thread-unsafe 
manner when writing Parquet files
 Key: SPARK-4629
 URL: https://issues.apache.org/jira/browse/SPARK-4629
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Michael Allman


The method {{ParquetRelation.createEmpty}} mutates its given Hadoop 
{{Configuration}} instance to set the Parquet writer compression level (cf. 
https://github.com/apache/spark/blob/v1.1.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala#L149).
 This can lead to a {{ConcurrentModificationException}} when running concurrent 
jobs sharing a single {{SparkContext}} which involve saving Parquet files.

Our fix was to simply remove the line in question and set the compression 
level in the hadoop configuration before starting our jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4583) GradientBoostedTrees error logging should use loss being minimized

2014-11-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4583.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3474
[https://github.com/apache/spark/pull/3474]

 GradientBoostedTrees error logging should use loss being minimized
 --

 Key: SPARK-4583
 URL: https://issues.apache.org/jira/browse/SPARK-4583
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor
 Fix For: 1.2.0


 Currently, the LogLoss used by GradientBoostedTrees has 2 issues:
 * the gradient (and therefore loss) does not match that used by Friedman 
 (1999)
 * the error computation uses 0/1 accuracy, not log loss



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2192) Examples Data Not in Binary Distribution

2014-11-26 Thread Pat McDonough (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226868#comment-14226868
 ] 

Pat McDonough commented on SPARK-2192:
--

Thanks [~srowen]. And yes, Xiangrui confirmed he just generated the data.

 Examples Data Not in Binary Distribution
 

 Key: SPARK-2192
 URL: https://issues.apache.org/jira/browse/SPARK-2192
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Pat McDonough

 The data used by examples is not packaged up with the binary distribution. 
 The data subdirectory of spark should make it's way in to the distribution 
 somewhere so the examples can use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4630) Dynamically determine optimal number of partitions

2014-11-26 Thread Kostas Sakellis (JIRA)
Kostas Sakellis created SPARK-4630:
--

 Summary: Dynamically determine optimal number of partitions
 Key: SPARK-4630
 URL: https://issues.apache.org/jira/browse/SPARK-4630
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kostas Sakellis


Partition sizes play a big part in how fast stages execute during a Spark job. 
There is a direct relationship between the size of partitions to the number of 
tasks - larger partitions, fewer tasks. For better performance, Spark has a 
sweet spot for how large partitions should be that get executed by a task. If 
partitions are too small, then the user pays a disproportionate cost in 
scheduling overhead. If the partitions are too large, then task execution slows 
down due to gc pressure and spilling to disk.

To increase performance of jobs, users often hand optimize the number(size) of 
partitions that the next stage gets. Factors that come into play are:
Incoming partition sizes from previous stage
number of available executors
available memory per executor (taking into account spark.shuffle.memoryFraction)

Spark has access to this data and so should be able to automatically do the 
partition sizing for the user. This feature can be turned off/on with a 
configuration option. 

To make this happen, we propose modifying the DAGScheduler to take into account 
partition sizes upon stage completion. Before scheduling the next stage, the 
scheduler can examine the sizes of the partitions and determine the appropriate 
number tasks to create. Since this change requires non-trivial modifications to 
the DAGScheduler, a detailed design doc will be attached before proceeding with 
the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4630) Dynamically determine optimal number of partitions

2014-11-26 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4630:
--
Assignee: Kostas Sakellis

 Dynamically determine optimal number of partitions
 --

 Key: SPARK-4630
 URL: https://issues.apache.org/jira/browse/SPARK-4630
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis

 Partition sizes play a big part in how fast stages execute during a Spark 
 job. There is a direct relationship between the size of partitions to the 
 number of tasks - larger partitions, fewer tasks. For better performance, 
 Spark has a sweet spot for how large partitions should be that get executed 
 by a task. If partitions are too small, then the user pays a disproportionate 
 cost in scheduling overhead. If the partitions are too large, then task 
 execution slows down due to gc pressure and spilling to disk.
 To increase performance of jobs, users often hand optimize the number(size) 
 of partitions that the next stage gets. Factors that come into play are:
 Incoming partition sizes from previous stage
 number of available executors
 available memory per executor (taking into account 
 spark.shuffle.memoryFraction)
 Spark has access to this data and so should be able to automatically do the 
 partition sizing for the user. This feature can be turned off/on with a 
 configuration option. 
 To make this happen, we propose modifying the DAGScheduler to take into 
 account partition sizes upon stage completion. Before scheduling the next 
 stage, the scheduler can examine the sizes of the partitions and determine 
 the appropriate number tasks to create. Since this change requires 
 non-trivial modifications to the DAGScheduler, a detailed design doc will be 
 attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-26 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958
 ] 

Aaron Staple commented on SPARK-1503:
-

[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning. So, wary of these sorts of comments, I wrote up the proposal 
with the intention of duplicating TFOCS as closely as possible to start with, 
with the possibility of making changes from there based on performance. In 
addition, I’d interpreted an earlier comment in this ticket as suggesting that 
line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking. I’ve also noticed some 
references online to distributed implementations of accelerated methods. It may 
be informative to learn more about them - if you’ve heard of any particularly 
good distributed optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2014-11-26 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226969#comment-14226969
 ] 

Sandy Ryza commented on SPARK-4452:
---

Thinking about the current change a little more, an issue is that it will spill 
all the in-memory data to disk in situations where this is probably overkill.  
E.g. consider the typical situation of shuffle data slightly exceeding memory.  
We end up spilling the entire data structure if a downstream data structure 
needs even a small amount of memory.

I think that your proposed change 2 is probably worthwhile.

 Shuffle data structures can starve others on the same thread for memory 
 

 Key: SPARK-4452
 URL: https://issues.apache.org/jira/browse/SPARK-4452
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Tianshuo Deng
Assignee: Tianshuo Deng
Priority: Critical

 When an Aggregator is used with ExternalSorter in a task, spark will create 
 many small files and could cause too many files open error during merging.
 Currently, ShuffleMemoryManager does not work well when there are 2 spillable 
 objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
 by Aggregator) in this case. Here is an example: Due to the usage of mapside 
 aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
 ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
 on when ExternalSorter is created in the same thread, the 
 ShuffleMemoryManager could refuse to allocate more memory to it, since the 
 memory is already given to the previous requested 
 object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
 small files(due to the lack of memory)
 I'm currently working on a PR to address these two issues. It will include 
 following changes:
 1. The ShuffleMemoryManager should not only track the memory usage for each 
 thread, but also the object who holds the memory
 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
 spillable object. In this way, if a new object in a thread is requesting 
 memory, the old occupant could be evicted/spilled. Previously the spillable 
 objects trigger spilling by themselves. So one may not trigger spilling even 
 if another object in the same thread needs more memory. After this change The 
 ShuffleMemoryManager could trigger the spilling of an object if it needs to.
 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
 ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
 after the iterator is returned. This should be changed so that even after the 
 iterator is returned, the ShuffleMemoryManager can still spill it.
 Currently, I have a working branch in progress: 
 https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made 
 change 3 and have a prototype of change 1 and 2 to evict spillable from 
 memory manager, still in progress. I will send a PR when it's done.
 Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-26 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958
 ] 

Aaron Staple edited comment on SPARK-1503 at 11/26/14 11:09 PM:


[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning. So, wary of these sorts of comments, I wrote up the proposal 
with the intention of duplicating TFOCS as closely as possible to start with, 
with the possibility of making changes from there based on performance. In 
addition, I’d interpreted an earlier comment in this ticket as suggesting that 
line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking (for our objective 
functions in particular). I’ve also noticed some references online to 
distributed implementations of accelerated methods. It may be informative to 
learn more about them - if you’ve heard of any particularly good distributed 
optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.


was (Author: staple):
[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning. So, wary of these sorts of comments, I wrote up the proposal 
with the intention of duplicating TFOCS as closely as possible to start with, 
with the possibility of making changes from there based on performance. In 
addition, I’d interpreted an earlier comment in this ticket as suggesting that 
line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking. I’ve also noticed some 
references online to distributed implementations of accelerated methods. It may 
be informative to learn more about them - if you’ve heard of any particularly 
good distributed optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a 

[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-26 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958
 ] 

Aaron Staple edited comment on SPARK-1503 at 11/26/14 11:10 PM:


[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning applications. So, wary of these sorts of comments, I wrote up 
the proposal with the intention of duplicating TFOCS as closely as possible to 
start with, with the possibility of making changes from there based on 
performance. In addition, I’d interpreted an earlier comment in this ticket as 
suggesting that line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking (for our objective 
functions in particular). I’ve also noticed some references online to 
distributed implementations of accelerated methods. It may be informative to 
learn more about them - if you’ve heard of any particularly good distributed 
optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.


was (Author: staple):
[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning. So, wary of these sorts of comments, I wrote up the proposal 
with the intention of duplicating TFOCS as closely as possible to start with, 
with the possibility of making changes from there based on performance. In 
addition, I’d interpreted an earlier comment in this ticket as suggesting that 
line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking (for our objective 
functions in particular). I’ve also noticed some references online to 
distributed implementations of accelerated methods. It may be informative to 
learn more about them - if you’ve heard of any particularly good distributed 
optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron 

[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-26 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226958#comment-14226958
 ] 

Aaron Staple edited comment on SPARK-1503 at 11/26/14 11:14 PM:


[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning applications. So, wary of these sorts of comments, I wrote up 
the proposal with the intention of duplicating TFOCS as closely as possible to 
start with, with the possibility of making changes from there based on 
performance. In addition, I’d interpreted an earlier comment in this ticket as 
suggesting that line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking (for our objective 
functions in particular). I’ve also noticed some references online to 
distributed implementations of accelerated methods. It may be informative to 
learn more about them - if you happen to have heard of any particularly good 
distributed optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.


was (Author: staple):
[~rezazadeh] Thanks for your feedback.

Your point about the communication cost of backtracking is well taken. Just to 
explain where I was coming from with the design proposal: As I was looking to 
come up to speed on accelerated gradient descent, I came across some scattered 
comments online suggesting that acceleration was difficult to implement well, 
was finicky, etc - especially when compared with standard gradient descent for 
machine learning applications. So, wary of these sorts of comments, I wrote up 
the proposal with the intention of duplicating TFOCS as closely as possible to 
start with, with the possibility of making changes from there based on 
performance. In addition, I’d interpreted an earlier comment in this ticket as 
suggesting that line search be implemented in the same manner as in TFOCS.

I am happy to implement a constant step size first, though. It may also be 
informative to run some performance tests in spark both with and without 
backtracking. One (basic, not conclusive) data point I have now is that, if I 
run TFOCS’ test_LASSO example it triggers 97 iterations of the outer AT loop 
and 106 iterations of the inner backtracking loop. For this one example, the 
backtracking iteration overhead is only about 10%. Though keep in mind that in 
spark if we removed backtracking entirely it would mean only one distributed 
aggregation per iteration rather than two - so a huge improvement in 
communication cost assuming there is still a good convergence rate.

Incidentally, are there any specific learning benchmarks for spark that you 
would recommend?

I’ll do a bit of research to identify the best ways to manage the lipschitz 
estimate / step size in the absence of backtracking (for our objective 
functions in particular). I’ve also noticed some references online to 
distributed implementations of accelerated methods. It may be informative to 
learn more about them - if you’ve heard of any particularly good distributed 
optimizers using acceleration, please let me know.

Thanks,
Aaron

PS Yes, I’ll make sure to follow the lbfgs example so the accelerated 
implementation can be easily substituted.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
   

[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-26 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227023#comment-14227023
 ] 

Reza Zadeh commented on SPARK-1503:
---

Thanks Aaron. From an implementation perspective, it's probably easier to 
implement a constant step size first. From there you can see if there is any 
finicky behavior and compare to the unaccelerated proximal gradient already in 
Spark. If it works well enough, we should commit the first PR without 
backtracking, and then experiment with backtracking, otherwise if you see 
strange behavior then you can decide if backtracking would solve it. 

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4046) Incorrect Java example on site

2014-11-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4046.

Resolution: Fixed
  Assignee: Reynold Xin

 Incorrect Java example on site
 --

 Key: SPARK-4046
 URL: https://issues.apache.org/jira/browse/SPARK-4046
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Java API
Affects Versions: 1.1.0
 Environment: Web
Reporter: Ian Babrou
Assignee: Reynold Xin
Priority: Minor
 Attachments: SPARK-4046.diff


 https://spark.apache.org/examples.html
 Here word count example for java is incorrect. It should be mapToPair instead 
 of map. Correct example is here:
 https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaWordCount.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4046) Incorrect Java example on site

2014-11-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227053#comment-14227053
 ] 

Reynold Xin commented on SPARK-4046:


Thanks - I pushed the change and updated the website.

 Incorrect Java example on site
 --

 Key: SPARK-4046
 URL: https://issues.apache.org/jira/browse/SPARK-4046
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Java API
Affects Versions: 1.1.0
 Environment: Web
Reporter: Ian Babrou
Assignee: Reynold Xin
Priority: Minor
 Attachments: SPARK-4046.diff


 https://spark.apache.org/examples.html
 Here word count example for java is incorrect. It should be mapToPair instead 
 of map. Correct example is here:
 https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaWordCount.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4631) Add real unit test for MQTT

2014-11-26 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-4631:


 Summary: Add real unit test for MQTT 
 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical


A real unit test that actually transfers data to ensure that the MQTTUtil is 
functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-11-26 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-3628.
--
  Resolution: Fixed
   Fix Version/s: 1.2.0
Target Version/s: 1.1.2  (was: 0.9.3, 1.0.3, 1.1.2, 1.2.1)

 Don't apply accumulator updates multiple times for tasks in result stages
 -

 Key: SPARK-3628
 URL: https://issues.apache.org/jira/browse/SPARK-3628
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Nan Zhu
Priority: Blocker
 Fix For: 1.2.0


 In previous versions of Spark, accumulator updates only got applied once for 
 accumulators that are only used in actions (i.e. result stages), letting you 
 use them to deterministically compute a result. Unfortunately, this got 
 broken in some recent refactorings.
 This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
 issue is about applying the same semantics to intermediate stages too, which 
 is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-11-26 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227077#comment-14227077
 ] 

Matei Zaharia commented on SPARK-3628:
--

FYI I merged this into 1.2.0, since the patch is now quite a bit smaller. We 
should decide whether we want to back port it to branch-1.1, so I'll leave it 
open for that reason. I don't think there's much point backporting it further 
because the issue is somewhat rare, but we can do it if people ask for it.

 Don't apply accumulator updates multiple times for tasks in result stages
 -

 Key: SPARK-3628
 URL: https://issues.apache.org/jira/browse/SPARK-3628
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Nan Zhu
Priority: Blocker
 Fix For: 1.2.0


 In previous versions of Spark, accumulator updates only got applied once for 
 accumulators that are only used in actions (i.e. result stages), letting you 
 use them to deterministically compute a result. Unfortunately, this got 
 broken in some recent refactorings.
 This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
 issue is about applying the same semantics to intermediate stages too, which 
 is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4631) Add real unit test for MQTT

2014-11-26 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227076#comment-14227076
 ] 

Tathagata Das commented on SPARK-4631:
--

[~prabeeshk] Could you please take a crack at this. This is important because 
other it is being impossible for us to maintain the MQTT functionalities while 
MQTT guys are pulling down older version 0.4.0 (that we currently depend on). 
We cannot upgrade to 1.0+ until we have a unit test that actually tests whether 
the stream works. 



 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical

 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client

2014-11-26 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227091#comment-14227091
 ] 

Tathagata Das commented on SPARK-4632:
--

This is a less scary alternative to the JIRA SPARK-SPARK-4628

 Upgrade MQTT dependency to use latest mqtt-client
 -

 Key: SPARK-4632
 URL: https://issues.apache.org/jira/browse/SPARK-4632
 Project: Spark
  Issue Type: Test
  Components: Streaming
Affects Versions: 1.0.2, 1.1.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is 
 breaking Spark build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-732) Recomputation of RDDs may result in duplicated accumulator updates

2014-11-26 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227108#comment-14227108
 ] 

Matei Zaharia commented on SPARK-732:
-

As discussed on https://github.com/apache/spark/pull/2524 this is pretty hard 
to provide good semantics for in the general case (accumulator updates inside 
non-result stages), for the following reasons:

- An RDD may be computed as part of multiple stages. For example, if you update 
an accumulator inside a MappedRDD and then shuffle it, that might be one stage. 
But if you then call map() again on the MappedRDD, and shuffle the result of 
that, you get a second stage where that map is pipeline. Do you want to count 
this accumulator update twice or not?
- Entire stages may be resubmitted if shuffle files are deleted by the periodic 
cleaner or are lost due to a node failure, so anything that tracks RDDs would 
need to do so for long periods of time (as long as the RDD is referenceable in 
the user program), which would be pretty complicated to implement.

So I'm going to mark this as won't fix for now, except for the part for 
result stages done in SPARK-3628.

 Recomputation of RDDs may result in duplicated accumulator updates
 --

 Key: SPARK-732
 URL: https://issues.apache.org/jira/browse/SPARK-732
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.7.0, 0.6.2, 0.7.1, 0.8.0, 0.7.2, 0.7.3, 0.8.1, 0.8.2, 
 0.9.0, 1.0.1, 1.1.0
Reporter: Josh Rosen
Assignee: Nan Zhu
Priority: Blocker

 Currently, Spark doesn't guard against duplicated updates to the same 
 accumulator due to recomputations of an RDD.  For example:
 {code}
 val acc = sc.accumulator(0)
 data.map(x = acc += 1; f(x))
 data.count()
 // acc should equal data.count() here
 data.foreach{...}
 // Now, acc = 2 * data.count() because the map() was recomputed.
 {code}
 I think that this behavior is incorrect, especially because this behavior 
 allows the additon or removal of a cache() call to affect the outcome of a 
 computation.
 There's an old TODO to fix this duplicate update issue in the [DAGScheduler 
 code|https://github.com/mesos/spark/blob/ec5e553b418be43aa3f0ccc24e0d5ca9d63504b2/core/src/main/scala/spark/scheduler/DAGScheduler.scala#L494].
 I haven't tested whether recomputation due to blocks being dropped from the 
 cache can trigger duplicate accumulator updates.
 Hypothetically someone could be relying on the current behavior to implement 
 performance counters that track the actual number of computations performed 
 (including recomputations).  To be safe, we could add an explicit warning in 
 the release notes that documents the change in behavior when we fix this.
 Ignoring duplicate updates shouldn't be too hard, but there are a few 
 subtleties.  Currently, we allow accumulators to be used in multiple 
 transformations, so we'd need to detect duplicate updates at the 
 per-transformation level.  I haven't dug too deeply into the scheduler 
 internals, but we might also run into problems where pipelining causes what 
 is logically one set of accumulator updates to show up in two different tasks 
 (e.g. rdd.map(accum += x; ...) and rdd.map(accum += x; ...).count() may cause 
 what's logically the same accumulator update to be applied from two different 
 contexts, complicating the detection of duplicate updates).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3628) Don't apply accumulator updates multiple times for tasks in result stages

2014-11-26 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reopened SPARK-3628:
--

 Don't apply accumulator updates multiple times for tasks in result stages
 -

 Key: SPARK-3628
 URL: https://issues.apache.org/jira/browse/SPARK-3628
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Nan Zhu
Priority: Blocker
 Fix For: 1.2.0


 In previous versions of Spark, accumulator updates only got applied once for 
 accumulators that are only used in actions (i.e. result stages), letting you 
 use them to deterministically compute a result. Unfortunately, this got 
 broken in some recent refactorings.
 This is related to https://issues.apache.org/jira/browse/SPARK-732, but that 
 issue is about applying the same semantics to intermediate stages too, which 
 is more work and may not be what we want for debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4631) Add real unit test for MQTT

2014-11-26 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227121#comment-14227121
 ] 

Tathagata Das commented on SPARK-4631:
--

I am optimistically going ahead and upgrading MQTT client version in 
SPARK-4632. If things break, we will fix it in a patch release.

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical

 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client

2014-11-26 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-4632:
-
Issue Type: Improvement  (was: Test)

 Upgrade MQTT dependency to use latest mqtt-client
 -

 Key: SPARK-4632
 URL: https://issues.apache.org/jira/browse/SPARK-4632
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is 
 breaking Spark build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4633) Support gzip in spark.compression.io.codec

2014-11-26 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-4633:
---

 Summary: Support gzip in spark.compression.io.codec
 Key: SPARK-4633
 URL: https://issues.apache.org/jira/browse/SPARK-4633
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Reporter: Takeshi Yamamuro
Priority: Trivial


gzip is widely used in other frameowrks such as hadoop mapreduce and tez, and 
also
I think that gizip is more stable than other codecs in terms of both performance
and space overheads.

I have one open question; current spark configuratios have a block size option
for each codec (spark.io.compression.[gzip|lz4|snappy].block.size).
As # of codecs increases, the configurations have more options and
I think that it is sort of complicated for non-expert users.

To mitigate it, my thought follows;
the three configurations are replaced with a single option for block size
(spark.io.compression.block.size). Then, 'Meaning' in configurations
will describe This option makes an effect on gzip, lz4, and snappy. 
Block size (in bytes) used in compression, in the case when these compression
codecs are used. Lowering



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4633) Support gzip in spark.compression.io.codec

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227133#comment-14227133
 ] 

Apache Spark commented on SPARK-4633:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/3488

 Support gzip in spark.compression.io.codec
 --

 Key: SPARK-4633
 URL: https://issues.apache.org/jira/browse/SPARK-4633
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Reporter: Takeshi Yamamuro
Priority: Trivial

 gzip is widely used in other frameowrks such as hadoop mapreduce and tez, and 
 also
 I think that gizip is more stable than other codecs in terms of both 
 performance
 and space overheads.
 I have one open question; current spark configuratios have a block size option
 for each codec (spark.io.compression.[gzip|lz4|snappy].block.size).
 As # of codecs increases, the configurations have more options and
 I think that it is sort of complicated for non-expert users.
 To mitigate it, my thought follows;
 the three configurations are replaced with a single option for block size
 (spark.io.compression.block.size). Then, 'Meaning' in configurations
 will describe This option makes an effect on gzip, lz4, and snappy. 
 Block size (in bytes) used in compression, in the case when these compression
 codecs are used. Lowering



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4634) Enable metrics for each application to be gathered in one node

2014-11-26 Thread Masayoshi TSUZUKI (JIRA)
Masayoshi TSUZUKI created SPARK-4634:


 Summary: Enable metrics for each application to be gathered in one 
node
 Key: SPARK-4634
 URL: https://issues.apache.org/jira/browse/SPARK-4634
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Masayoshi TSUZUKI


Metrics output is now like this:
{noformat}
  - app_1.driver.jvm.somevalue
  - app_1.driver.jvm.somevalue
  - ...
  - app_2.driver.jvm.somevalue
  - app_2.driver.jvm.somevalue
  - ...
{noformat}
In current spark, application names come to top level,
but we should be able to gather the application names under some top level node.

For example, think of using graphite.
When we use graphite, the application names are listed as top level node.
Graphite can also collect OS metrics, and OS metrics are able to be put in some 
one node.
But the current Spark metrics are not.
So, with the current Spark, the tree structure of metrics shown in graphite web 
UI is like this.
{noformat}
  - os
- os.node1.somevalue
- os.node2.somevalue
- ...
  - app_1
- app_1.driver.jvm.somevalue
- app_1.driver.jvm.somevalue
- ...
  - app_2
- ...
  - app_3
- ...
{noformat}
We should be able to add some top level name before the application name (top 
level name may be cluster name for instance).

If we make the name configurable by *.conf, it might be also convenience in 
case that 2 different spark clusters sink metrics to the same graphite server.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4634) Enable metrics for each application to be gathered in one node

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227141#comment-14227141
 ] 

Apache Spark commented on SPARK-4634:
-

User 'tsudukim' has created a pull request for this issue:
https://github.com/apache/spark/pull/3489

 Enable metrics for each application to be gathered in one node
 --

 Key: SPARK-4634
 URL: https://issues.apache.org/jira/browse/SPARK-4634
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Masayoshi TSUZUKI

 Metrics output is now like this:
 {noformat}
   - app_1.driver.jvm.somevalue
   - app_1.driver.jvm.somevalue
   - ...
   - app_2.driver.jvm.somevalue
   - app_2.driver.jvm.somevalue
   - ...
 {noformat}
 In current spark, application names come to top level,
 but we should be able to gather the application names under some top level 
 node.
 For example, think of using graphite.
 When we use graphite, the application names are listed as top level node.
 Graphite can also collect OS metrics, and OS metrics are able to be put in 
 some one node.
 But the current Spark metrics are not.
 So, with the current Spark, the tree structure of metrics shown in graphite 
 web UI is like this.
 {noformat}
   - os
 - os.node1.somevalue
 - os.node2.somevalue
 - ...
   - app_1
 - app_1.driver.jvm.somevalue
 - app_1.driver.jvm.somevalue
 - ...
   - app_2
 - ...
   - app_3
 - ...
 {noformat}
 We should be able to add some top level name before the application name (top 
 level name may be cluster name for instance).
 If we make the name configurable by *.conf, it might be also convenience in 
 case that 2 different spark clusters sink metrics to the same graphite server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3704) Short (TINYINT) incorrectly handled in thrift JDBC/ODBC server

2014-11-26 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3704:

Summary: Short (TINYINT) incorrectly handled in thrift JDBC/ODBC server  
(was: the types not match adding value form spark row to hive row in 
SparkSQLOperationManager)

 Short (TINYINT) incorrectly handled in thrift JDBC/ODBC server
 --

 Key: SPARK-3704
 URL: https://issues.apache.org/jira/browse/SPARK-3704
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.1.1, 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-11-26 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227266#comment-14227266
 ] 

Jeremy Freeman commented on SPARK-3995:
---

Agree with [~mengxr], that's a better strategy.

 [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
 -

 Key: SPARK-3995
 URL: https://issues.apache.org/jira/browse/SPARK-3995
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Jeremy Freeman
Assignee: Jeremy Freeman
Priority: Critical
 Fix For: 1.2.0


 There is a breaking bug in PySpark's sampling methods when run with NumPy 
 v1.9. This is the version of NumPy included with the current Anaconda 
 distribution (v2.1); this is a popular distribution, and is likely to affect 
 many users.
 Steps to reproduce are:
 {code:python}
 foo = sc.parallelize(range(1000),5)
 foo.takeSample(False, 10)
 {code}
 Returns:
 {code}
 PythonException: Traceback (most recent call last):
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, 
 line 79, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py,
  line 196, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py,
  line 127, in dump_stream
 for obj in iterator:
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py,
  line 185, in _batched
 for item in iterator:
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py,
  line 116, in func
 if self.getUniformSample(split) = self._fraction:
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py,
  line 58, in getUniformSample
 self.initRandomGenerator(split)
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py,
  line 44, in initRandomGenerator
 self._random = numpy.random.RandomState(self._seed)
   File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
 (numpy/random/mtrand/mtrand.c:7397)
   File mtrand.pyx, line 646, in mtrand.RandomState.seed 
 (numpy/random/mtrand/mtrand.c:7697)
 ValueError: Seed must be between 0 and 4294967295
 {code}
 In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:
 {code:python}
 self._seed = seed if seed is not None else random.randint(0, sys.maxint)
 {code}
 In previous versions of NumPy a random seed larger than 2 ** 32 would 
 silently get truncated to 2 ** 32. This was fixed in a recent patch 
 (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
  But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, 
 which effectively breaks sampling operations in PySpark (unless the seed is 
 set manually).
 I am putting a PR together now (the fix is very simple!).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4630) Dynamically determine optimal number of partitions

2014-11-26 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227280#comment-14227280
 ] 

Lianhui Wang commented on SPARK-4630:
-

i am interesting in this. and we can learn some methon from 
http://hortonworks.com/blog/apache-tez-dynamic-graph-reconfiguration/.
and there is a difficult situation.example: 
s0 and s1 's parent is s2. s2 and s3 's parent is s4.
because s3's mapoutput is dependent on s4's partition, when s3 will run s4's 
partition cannot be examined because s2 is not finished.
[~kostas] can you share some ideas about this situation? thanks.

 Dynamically determine optimal number of partitions
 --

 Key: SPARK-4630
 URL: https://issues.apache.org/jira/browse/SPARK-4630
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis

 Partition sizes play a big part in how fast stages execute during a Spark 
 job. There is a direct relationship between the size of partitions to the 
 number of tasks - larger partitions, fewer tasks. For better performance, 
 Spark has a sweet spot for how large partitions should be that get executed 
 by a task. If partitions are too small, then the user pays a disproportionate 
 cost in scheduling overhead. If the partitions are too large, then task 
 execution slows down due to gc pressure and spilling to disk.
 To increase performance of jobs, users often hand optimize the number(size) 
 of partitions that the next stage gets. Factors that come into play are:
 Incoming partition sizes from previous stage
 number of available executors
 available memory per executor (taking into account 
 spark.shuffle.memoryFraction)
 Spark has access to this data and so should be able to automatically do the 
 partition sizing for the user. This feature can be turned off/on with a 
 configuration option. 
 To make this happen, we propose modifying the DAGScheduler to take into 
 account partition sizes upon stage completion. Before scheduling the next 
 stage, the scheduler can examine the sizes of the partitions and determine 
 the appropriate number tasks to create. Since this change requires 
 non-trivial modifications to the DAGScheduler, a detailed design doc will be 
 attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4630) Dynamically determine optimal number of partitions

2014-11-26 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227280#comment-14227280
 ] 

Lianhui Wang edited comment on SPARK-4630 at 11/27/14 5:34 AM:
---

i am interesting in this. and we can learn some method from 
http://hortonworks.com/blog/apache-tez-dynamic-graph-reconfiguration/.
and there is a difficult situation.example: 
s0 and s1 's parent is s2. s2 and s3 's parent is s4.
because s3's mapoutput is dependent on s4's partition, when s3 will run s4's 
partition cannot be examined because s2 is not finished.
[~kostas] can you share some ideas about this situation? thanks.


was (Author: lianhuiwang):
i am interesting in this. and we can learn some methon from 
http://hortonworks.com/blog/apache-tez-dynamic-graph-reconfiguration/.
and there is a difficult situation.example: 
s0 and s1 's parent is s2. s2 and s3 's parent is s4.
because s3's mapoutput is dependent on s4's partition, when s3 will run s4's 
partition cannot be examined because s2 is not finished.
[~kostas] can you share some ideas about this situation? thanks.

 Dynamically determine optimal number of partitions
 --

 Key: SPARK-4630
 URL: https://issues.apache.org/jira/browse/SPARK-4630
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis

 Partition sizes play a big part in how fast stages execute during a Spark 
 job. There is a direct relationship between the size of partitions to the 
 number of tasks - larger partitions, fewer tasks. For better performance, 
 Spark has a sweet spot for how large partitions should be that get executed 
 by a task. If partitions are too small, then the user pays a disproportionate 
 cost in scheduling overhead. If the partitions are too large, then task 
 execution slows down due to gc pressure and spilling to disk.
 To increase performance of jobs, users often hand optimize the number(size) 
 of partitions that the next stage gets. Factors that come into play are:
 Incoming partition sizes from previous stage
 number of available executors
 available memory per executor (taking into account 
 spark.shuffle.memoryFraction)
 Spark has access to this data and so should be able to automatically do the 
 partition sizing for the user. This feature can be turned off/on with a 
 configuration option. 
 To make this happen, we propose modifying the DAGScheduler to take into 
 account partition sizes upon stage completion. Before scheduling the next 
 stage, the scheduler can examine the sizes of the partitions and determine 
 the appropriate number tasks to create. Since this change requires 
 non-trivial modifications to the DAGScheduler, a detailed design doc will be 
 attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4635) Delete the val that never used in HashOuterJoin.

2014-11-26 Thread DoingDone9 (JIRA)
DoingDone9 created SPARK-4635:
-

 Summary: Delete the val that never used in HashOuterJoin.
 Key: SPARK-4635
 URL: https://issues.apache.org/jira/browse/SPARK-4635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: DoingDone9
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4635) Delete the val that never used in HashOuterJoin.

2014-11-26 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4635:
--
Description: The val boundCondition is created in execute(),but it never 
be used in execute();

 Delete the val that never used in HashOuterJoin.
 

 Key: SPARK-4635
 URL: https://issues.apache.org/jira/browse/SPARK-4635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: DoingDone9
Priority: Minor

 The val boundCondition is created in execute(),but it never be used in 
 execute();



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4631) Add real unit test for MQTT

2014-11-26 Thread Prabeesh K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227309#comment-14227309
 ] 

Prabeesh K commented on SPARK-4631:
---

[~tdas] Please update to new 
[resolver|https://repo.eclipse.org/content/repositories/paho/] from the [older 
resolver | https://repo.eclipse.org/content/repositories/paho-releases/]

Also update the  external/mqtt/pom.xml  with following
groupIdorg.eclipse.paho/groupId
artifactIdmqtt-client/artifactId
version0.4.1-SNAPSHOT/version

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical

 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4635) Delete the val that never used in execute() of HashOuterJoin.

2014-11-26 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4635:
--
Summary: Delete the val that never used in  execute() of HashOuterJoin.  
(was: Delete the val that never used in HashOuterJoin.)

 Delete the val that never used in  execute() of HashOuterJoin.
 --

 Key: SPARK-4635
 URL: https://issues.apache.org/jira/browse/SPARK-4635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: DoingDone9
Priority: Minor

 The val boundCondition is created in execute(),but it never be used in 
 execute();



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4635) Delete the val that never used in execute() of HashOuterJoin.

2014-11-26 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-4635:
--
Description: The val boundCondition is created in execute(),but it never 
be used in execute().  (was: The val boundCondition is created in 
execute(),but it never be used in execute();)

 Delete the val that never used in  execute() of HashOuterJoin.
 --

 Key: SPARK-4635
 URL: https://issues.apache.org/jira/browse/SPARK-4635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: DoingDone9
Priority: Minor

 The val boundCondition is created in execute(),but it never be used in 
 execute().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4170) Closure problems when running Scala app that extends App

2014-11-26 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227319#comment-14227319
 ] 

Brennon York commented on SPARK-4170:
-

[~srowen] this bug seems to be an issue with the way Scala has been defined and 
the differences between what happens at runtime versus compile time with 
respect to the way {code}App{code} leverages the [{code}delayedInit{code} 
function|http://www.scala-lang.org/api/2.11.1/index.html#scala.App].

I tried to replicate the issue on my local machine under both compile time and 
runtime with only the latter producing the issue (as expected through the Scala 
documentation). The former was tested by creating a simple application, 
compiled with sbt, and executed while the latter was setup within the 
{code}spark-shell{code} REPL. I'm wondering if we can't close this issue and 
just provide a bit of documentation somewhere to reference that, when building 
even simple Spark apps, extending the {code}App{code} interface will result in 
delayed initialization and, likely, set null values within that closure. 
Thoughts?

 Closure problems when running Scala app that extends App
 --

 Key: SPARK-4170
 URL: https://issues.apache.org/jira/browse/SPARK-4170
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Sean Owen
Priority: Minor

 Michael Albert noted this problem on the mailing list 
 (http://apache-spark-user-list.1001560.n3.nabble.com/BUG-when-running-as-quot-extends-App-quot-closures-don-t-capture-variables-td17675.html):
 {code}
 object DemoBug extends App {
 val conf = new SparkConf()
 val sc = new SparkContext(conf)
 val rdd = sc.parallelize(List(A,B,C,D))
 val str1 = A
 val rslt1 = rdd.filter(x = { x != A }).count
 val rslt2 = rdd.filter(x = { str1 != null  x != A }).count
 
 println(DemoBug: rslt1 =  + rslt1 +  rslt2 =  + rslt2)
 }
 {code}
 This produces the output:
 {code}
 DemoBug: rslt1 = 3 rslt2 = 0
 {code}
 If instead there is a proper main(), it works as expected.
 I also this week noticed that in a program which extends App, some values 
 were inexplicably null in a closure. When changing to use main(), it was fine.
 I assume there is a problem with variables not being added to the closure 
 when main() doesn't appear in the standard way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4170) Closure problems when running Scala app that extends App

2014-11-26 Thread Brennon York (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227319#comment-14227319
 ] 

Brennon York edited comment on SPARK-4170 at 11/27/14 6:48 AM:
---

[~srowen] this bug seems to be an issue with the way Scala has been defined and 
the differences between what happens at runtime versus compile time with 
respect to the way App leverages the [delayedInit 
function|http://www.scala-lang.org/api/2.11.1/index.html#scala.App].

I tried to replicate the issue on my local machine under both compile time and 
runtime with only the latter producing the issue (as expected through the Scala 
documentation). The former was tested by creating a simple application, 
compiled with sbt, and executed while the latter was setup within the 
spark-shell REPL. I'm wondering if we can't close this issue and just provide a 
bit of documentation somewhere to reference that, when building even simple 
Spark apps, extending the App interface will result in delayed initialization 
and, likely, set null values within that closure. Thoughts?


was (Author: boyork):
[~srowen] this bug seems to be an issue with the way Scala has been defined and 
the differences between what happens at runtime versus compile time with 
respect to the way {code}App{code} leverages the [{code}delayedInit{code} 
function|http://www.scala-lang.org/api/2.11.1/index.html#scala.App].

I tried to replicate the issue on my local machine under both compile time and 
runtime with only the latter producing the issue (as expected through the Scala 
documentation). The former was tested by creating a simple application, 
compiled with sbt, and executed while the latter was setup within the 
{code}spark-shell{code} REPL. I'm wondering if we can't close this issue and 
just provide a bit of documentation somewhere to reference that, when building 
even simple Spark apps, extending the {code}App{code} interface will result in 
delayed initialization and, likely, set null values within that closure. 
Thoughts?

 Closure problems when running Scala app that extends App
 --

 Key: SPARK-4170
 URL: https://issues.apache.org/jira/browse/SPARK-4170
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Sean Owen
Priority: Minor

 Michael Albert noted this problem on the mailing list 
 (http://apache-spark-user-list.1001560.n3.nabble.com/BUG-when-running-as-quot-extends-App-quot-closures-don-t-capture-variables-td17675.html):
 {code}
 object DemoBug extends App {
 val conf = new SparkConf()
 val sc = new SparkContext(conf)
 val rdd = sc.parallelize(List(A,B,C,D))
 val str1 = A
 val rslt1 = rdd.filter(x = { x != A }).count
 val rslt2 = rdd.filter(x = { str1 != null  x != A }).count
 
 println(DemoBug: rslt1 =  + rslt1 +  rslt2 =  + rslt2)
 }
 {code}
 This produces the output:
 {code}
 DemoBug: rslt1 = 3 rslt2 = 0
 {code}
 If instead there is a proper main(), it works as expected.
 I also this week noticed that in a program which extends App, some values 
 were inexplicably null in a closure. When changing to use main(), it was fine.
 I assume there is a problem with variables not being added to the closure 
 when main() doesn't appear in the standard way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4635) Delete the val that never used in execute() of HashOuterJoin.

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227321#comment-14227321
 ] 

Apache Spark commented on SPARK-4635:
-

User 'DoingDone9' has created a pull request for this issue:
https://github.com/apache/spark/pull/3491

 Delete the val that never used in  execute() of HashOuterJoin.
 --

 Key: SPARK-4635
 URL: https://issues.apache.org/jira/browse/SPARK-4635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: DoingDone9
Priority: Minor

 The val boundCondition is created in execute(),but it never be used in 
 execute().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227326#comment-14227326
 ] 

Apache Spark commented on SPARK-4632:
-

User 'prabeesh' has created a pull request for this issue:
https://github.com/apache/spark/pull/3494

 Upgrade MQTT dependency to use latest mqtt-client
 -

 Key: SPARK-4632
 URL: https://issues.apache.org/jira/browse/SPARK-4632
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is 
 breaking Spark build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227325#comment-14227325
 ] 

Apache Spark commented on SPARK-4632:
-

User 'prabeesh' has created a pull request for this issue:
https://github.com/apache/spark/pull/3492

 Upgrade MQTT dependency to use latest mqtt-client
 -

 Key: SPARK-4632
 URL: https://issues.apache.org/jira/browse/SPARK-4632
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is 
 breaking Spark build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4632) Upgrade MQTT dependency to use latest mqtt-client

2014-11-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227324#comment-14227324
 ] 

Apache Spark commented on SPARK-4632:
-

User 'prabeesh' has created a pull request for this issue:
https://github.com/apache/spark/pull/3493

 Upgrade MQTT dependency to use latest mqtt-client
 -

 Key: SPARK-4632
 URL: https://issues.apache.org/jira/browse/SPARK-4632
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2, 1.1.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker

 mqtt client 0.4.0 was removed from the Eclipse Paho repository, and hence is 
 breaking Spark build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4631) Add real unit test for MQTT

2014-11-26 Thread Prabeesh K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227309#comment-14227309
 ] 

Prabeesh K edited comment on SPARK-4631 at 11/27/14 6:54 AM:
-

[~tdas] Please update to new 
[resolver|https://repo.eclipse.org/content/repositories/paho/] from the [older 
resolver | https://repo.eclipse.org/content/repositories/paho-releases/]

Also update the  external/mqtt/pom.xml  with following
groupIdorg.eclipse.paho/groupId
artifactIdmqtt-client/artifactId
version0.4.1-SNAPSHOT/version

Refre SPARK-4632


was (Author: prabeeshk):
[~tdas] Please update to new 
[resolver|https://repo.eclipse.org/content/repositories/paho/] from the [older 
resolver | https://repo.eclipse.org/content/repositories/paho-releases/]

Also update the  external/mqtt/pom.xml  with following
groupIdorg.eclipse.paho/groupId
artifactIdmqtt-client/artifactId
version0.4.1-SNAPSHOT/version

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical

 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4631) Add real unit test for MQTT

2014-11-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227337#comment-14227337
 ] 

Reynold Xin commented on SPARK-4631:


How popular is MQTT? Do we really need this support in Spark natively? 

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical

 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4631) Add real unit test for MQTT

2014-11-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227337#comment-14227337
 ] 

Reynold Xin edited comment on SPARK-4631 at 11/27/14 7:11 AM:
--

How popular is MQTT? Do we really need this support in Spark natively? I think 
we should reconsider the dependency on it if it is not widely popular and it is 
causing us trouble supporting it and causing our users trouble due to build 
issues. 


was (Author: rxin):
How popular is MQTT? Do we really need this support in Spark natively? 

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical

 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4599) add hive profile to root pom

2014-11-26 Thread Adrian Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang closed SPARK-4599.
--
Resolution: Won't Fix

 add hive profile to root pom
 

 Key: SPARK-4599
 URL: https://issues.apache.org/jira/browse/SPARK-4599
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Adrian Wang
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4628) Put all external projects behind a build flag

2014-11-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227339#comment-14227339
 ] 

Reynold Xin commented on SPARK-4628:


Shall we also not depend on artifacts that are not in maven central? It should 
be a high bar for Spark to depend on some library, and if they can't make it to 
maven central, it doesn't inspire a lot of confidence.

Of course, this can only be a general rule and there can also be exceptions.


 Put all external projects behind a build flag
 -

 Key: SPARK-4628
 URL: https://issues.apache.org/jira/browse/SPARK-4628
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Wendell
Priority: Blocker

 This is something we talked about doing for convenience, but I'm escalating 
 this based on realizing today that some of our external projects depend on 
 code that is not in maven central. I.e. if one of these dependencies is taken 
 down (as happened recently with mqtt), all Spark builds will fail.
 The proposal here is simple, have a profile -Pexternal-projects that enables 
 these. This can follow the exact pattern of -Pkinesis-asl which was disabled 
 by default due to a license issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-11-26 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227352#comment-14227352
 ] 

Yu Ishikawa commented on SPARK-2429:


Hi [~yangjunpro], 

{quote}
1. In your implementation, in each divisive steps, there will be a copy 
operations to distribution the data nodes in the parent cluster tree to the 
split children cluster trees, when the document size is large, I think this 
copy cost is non-neglectable, right?
{quote}

Exactly. A cached memory is twice larger than the original data. For example, 
if data size is 10 GB, a spark cluster always has 20 GB cached RDD through the 
algorithm. The reason why I cache the data nodes at each time dividing is for 
elapsed time. That is, this algorithm is very slow without caching the data 
nodes.

{quote}
2. In your test code, the cluster size is not quite large( only about 100 ), 
have you ever tested it with big cluster size and big document corpus? e.g., 
1 clusters with 200 documents. What is the performance behavior facing 
this kind of use case?
{quote}

The test code deals with small data as you said. I think data size in unit 
tests should not be large in order to reduce the test time. Of course, I am 
willing to implement this algorithm to fit large input data and large clusters 
we want. Although I have never check the performance of this implementation 
with large clusters, such as 1, elapsed time can be long. I will check the 
performance under the condition. Or if possible, could you check the 
performance?

thanks


 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4631) Add real unit test for MQTT

2014-11-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227356#comment-14227356
 ] 

Sean Owen commented on SPARK-4631:
--

(FWIW I had never heard of MQTT before Spark. This does feel like something 
below the bar for inclusion in Spark and could be a separate project. Consider 
deprecating it?)

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical

 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >