[jira] [Commented] (SPARK-5212) Add support of schema-less transformation

2015-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274845#comment-14274845
 ] 

Apache Spark commented on SPARK-5212:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4014

> Add support of schema-less transformation
> -
>
> Key: SPARK-5212
> URL: https://issues.apache.org/jira/browse/SPARK-5212
> Project: Spark
>  Issue Type: Improvement
>Reporter: Liang-Chi Hsieh
>
> According to Hive's language manual, the AS clause should be optional in 
> transform 
> (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform) 
> syntax. This pr adds the support for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5212) Add support of schema-less transformation

2015-01-12 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-5212:
--

 Summary: Add support of schema-less transformation
 Key: SPARK-5212
 URL: https://issues.apache.org/jira/browse/SPARK-5212
 Project: Spark
  Issue Type: Bug
Reporter: Liang-Chi Hsieh


According to Hive's language manual, the AS clause should be optional in 
transform 
(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform) 
syntax. This pr adds the support for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5212) Add support of schema-less transformation

2015-01-12 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-5212:
---
Issue Type: Improvement  (was: Bug)

> Add support of schema-less transformation
> -
>
> Key: SPARK-5212
> URL: https://issues.apache.org/jira/browse/SPARK-5212
> Project: Spark
>  Issue Type: Improvement
>Reporter: Liang-Chi Hsieh
>
> According to Hive's language manual, the AS clause should be optional in 
> transform 
> (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform) 
> syntax. This pr adds the support for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5207) StandardScalerModel mean and variance re-use

2015-01-12 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274802#comment-14274802
 ] 

DB Tsai commented on SPARK-5207:


[~mengxr]'s idea sounds great for me. Specifically, let's have mean and 
variance as required variables in the constructor, and have withMean = false, 
and withStd = true as default variables. Add another two methods to change 
withMean and withStd. Thanks.

> StandardScalerModel mean and variance re-use
> 
>
> Key: SPARK-5207
> URL: https://issues.apache.org/jira/browse/SPARK-5207
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Octavian Geagla
>Assignee: Octavian Geagla
>
> From this discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-td10073.html
> Changing constructor to public would be a simple change, but a discussion is 
> needed to determine what args necessary for this change.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5138) pyspark unable to infer schema of namedtuple

2015-01-12 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5138.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3978
[https://github.com/apache/spark/pull/3978]

> pyspark unable to infer schema of namedtuple
> 
>
> Key: SPARK-5138
> URL: https://issues.apache.org/jira/browse/SPARK-5138
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.2.0
>Reporter: Gabe Mulley
>Priority: Trivial
> Fix For: 1.3.0
>
>
> When attempting to infer the schema of an RDD that contains namedtuples, 
> pyspark fails to identify the records as namedtuples, resulting in it raising 
> an error.
> Example:
> {noformat}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> from collections import namedtuple
> import os
> sc = SparkContext()
> rdd = sc.textFile(os.path.join(os.getenv('SPARK_HOME'), 'README.md'))
> TextLine = namedtuple('TextLine', 'line length')
> tuple_rdd = rdd.map(lambda l: TextLine(line=l, length=len(l)))
> tuple_rdd.take(5)  # This works
> sqlc = SQLContext(sc)
> # The following line raises an error
> schema_rdd = sqlc.inferSchema(tuple_rdd)
> {noformat}
> The error raised is:
> {noformat}
>   File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, 
> in main
> process()
>   File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in 
> process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line 
> 227, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1107, in 
> takeUpToNumLeft
> yield next(iterator)
>   File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/sql.py", line 816, in 
> convert_struct
> raise ValueError("unexpected tuple: %s" % obj)
> TypeError: not all arguments converted during string formatting
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4999) No need to put WAL-backed block into block manager by default

2015-01-12 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-4999.
--
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

> No need to put WAL-backed block into block manager by default
> -
>
> Key: SPARK-4999
> URL: https://issues.apache.org/jira/browse/SPARK-4999
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Saisai Shao
> Fix For: 1.3.0, 1.2.1
>
>
> Currently WAL-backed block is read out from HDFS and put into BlockManger 
> with storage level MEMORY_ONLY_SER by default, since WAL-backed block is 
> already fault-tolerant, no need to put into BlockManger again by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called

2015-01-12 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274685#comment-14274685
 ] 

Saisai Shao commented on SPARK-5147:


I'm working on this, the major part of work is done besides a small bug, I will 
figure out the problem and submit a PR.

> write ahead logs from streaming receiver are not purged because 
> cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
> --
>
> Key: SPARK-5147
> URL: https://issues.apache.org/jira/browse/SPARK-5147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Max Xu
>Priority: Blocker
>
> Hi all,
> We are running a Spark streaming application with ReliableKafkaReceiver. We 
> have "spark.streaming.receiver.writeAheadLog.enable" set to true so write 
> ahead logs (WALs) for received data are created under receivedData/streamId 
> folder in the checkpoint directory. 
> However, old WALs are never purged by time. receivedBlockMetadata and 
> checkpoint files are purged correctly though. I went through the code, 
> WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is 
> responsible for cleaning up the old blocks. It has method cleanupOldBlocks, 
> which is never called by any class. ReceiverSupervisorImpl class holds a 
> WriteAheadLogBasedBlockHandler  instance. However, it only calls storeBlock 
> method to create WALs but never calls cleanupOldBlocks method to purge old 
> WALs.
> The size of the WAL folder increases constantly on HDFS. This is preventing 
> us from running the ReliableKafkaReceiver 24x7. Can somebody please take a 
> look.
> Thanks,
> Max



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called

2015-01-12 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5147:
-
Target Version/s: 1.3.0, 1.2.1

> write ahead logs from streaming receiver are not purged because 
> cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
> --
>
> Key: SPARK-5147
> URL: https://issues.apache.org/jira/browse/SPARK-5147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Max Xu
>Priority: Blocker
>
> Hi all,
> We are running a Spark streaming application with ReliableKafkaReceiver. We 
> have "spark.streaming.receiver.writeAheadLog.enable" set to true so write 
> ahead logs (WALs) for received data are created under receivedData/streamId 
> folder in the checkpoint directory. 
> However, old WALs are never purged by time. receivedBlockMetadata and 
> checkpoint files are purged correctly though. I went through the code, 
> WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is 
> responsible for cleaning up the old blocks. It has method cleanupOldBlocks, 
> which is never called by any class. ReceiverSupervisorImpl class holds a 
> WriteAheadLogBasedBlockHandler  instance. However, it only calls storeBlock 
> method to create WALs but never calls cleanupOldBlocks method to purge old 
> WALs.
> The size of the WAL folder increases constantly on HDFS. This is preventing 
> us from running the ReliableKafkaReceiver 24x7. Can somebody please take a 
> look.
> Thanks,
> Max



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called

2015-01-12 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274683#comment-14274683
 ] 

Tathagata Das commented on SPARK-5147:
--

I think this is a critical bug. This should be fixed ASAP. Can you come up with 
a fix? 



> write ahead logs from streaming receiver are not purged because 
> cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
> --
>
> Key: SPARK-5147
> URL: https://issues.apache.org/jira/browse/SPARK-5147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Max Xu
>Priority: Blocker
>
> Hi all,
> We are running a Spark streaming application with ReliableKafkaReceiver. We 
> have "spark.streaming.receiver.writeAheadLog.enable" set to true so write 
> ahead logs (WALs) for received data are created under receivedData/streamId 
> folder in the checkpoint directory. 
> However, old WALs are never purged by time. receivedBlockMetadata and 
> checkpoint files are purged correctly though. I went through the code, 
> WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is 
> responsible for cleaning up the old blocks. It has method cleanupOldBlocks, 
> which is never called by any class. ReceiverSupervisorImpl class holds a 
> WriteAheadLogBasedBlockHandler  instance. However, it only calls storeBlock 
> method to create WALs but never calls cleanupOldBlocks method to purge old 
> WALs.
> The size of the WAL folder increases constantly on HDFS. This is preventing 
> us from running the ReliableKafkaReceiver 24x7. Can somebody please take a 
> look.
> Thanks,
> Max



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called

2015-01-12 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-5147:
-
Priority: Blocker  (was: Major)

> write ahead logs from streaming receiver are not purged because 
> cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
> --
>
> Key: SPARK-5147
> URL: https://issues.apache.org/jira/browse/SPARK-5147
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Max Xu
>Priority: Blocker
>
> Hi all,
> We are running a Spark streaming application with ReliableKafkaReceiver. We 
> have "spark.streaming.receiver.writeAheadLog.enable" set to true so write 
> ahead logs (WALs) for received data are created under receivedData/streamId 
> folder in the checkpoint directory. 
> However, old WALs are never purged by time. receivedBlockMetadata and 
> checkpoint files are purged correctly though. I went through the code, 
> WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is 
> responsible for cleaning up the old blocks. It has method cleanupOldBlocks, 
> which is never called by any class. ReceiverSupervisorImpl class holds a 
> WriteAheadLogBasedBlockHandler  instance. However, it only calls storeBlock 
> method to create WALs but never calls cleanupOldBlocks method to purge old 
> WALs.
> The size of the WAL folder increases constantly on HDFS. This is preventing 
> us from running the ReliableKafkaReceiver 24x7. Can somebody please take a 
> look.
> Thanks,
> Max



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint

2015-01-12 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274681#comment-14274681
 ] 

Tathagata Das commented on SPARK-5206:
--

Interesting observation! Can this be solved just by explicitly referencing the 
Accumulator object in the beginning of your program? If that works, then we can 
add this reference to Accumulator in the StreamingContext object to make sure 
it is automatically called. 

> Accumulators are not re-registered during recovering from checkpoint
> 
>
> Key: SPARK-5206
> URL: https://issues.apache.org/jira/browse/SPARK-5206
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: vincent ye
>
> I got exception as following while my streaming application restarts from 
> crash from checkpoit:
> 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR 
> scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, 
> 4)
> java.util.NoSuchElementException: key not found: 1
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> I guess that an Accumulator is registered to a singleton Accumulators in Line 
> 58 of org.apache.spark.Accumulable:
> Accumulators.register(this, true)
> This code need to be executed in the driver once. But when the application is 
> recovered from checkpoint. It won't be executed in the driver. So when the 
> driver process it at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938),
>  It can't find the Accumulator because it's not re-register during the 
> recovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5164) YARN | Spark job submits from windows machine to a linux YARN cluster fail

2015-01-12 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar resolved SPARK-5164.
-
Resolution: Duplicate

Duplicates and has similar findings to SPARK-1825.

> YARN | Spark job submits from windows machine to a linux YARN cluster fail
> --
>
> Key: SPARK-5164
> URL: https://issues.apache.org/jira/browse/SPARK-5164
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
> Environment: Spark submit from Windows 7
> YARN cluster on CentOS 6.5
>Reporter: Aniket Bhatnagar
>
> While submitting spark jobs from a windows machine to a linux YARN cluster, 
> the jobs fail because of the following reasons:
> 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, 
> etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead 
> of linux's syntax ($JAVA_HOME, $PWD, etc).
> 2. Paths in launch environment are delimited by semi-colon instead of colon. 
> This is because of usage of File.pathSeparator in YarnSparkHadoopUtil.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4924) Factor out code to launch Spark applications into a separate library

2015-01-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4924:
-
Target Version/s: 1.3.0

> Factor out code to launch Spark applications into a separate library
> 
>
> Key: SPARK-4924
> URL: https://issues.apache.org/jira/browse/SPARK-4924
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: spark-launcher.txt
>
>
> One of the questions we run into rather commonly is "how to start a Spark 
> application from my Java/Scala program?". There currently isn't a good answer 
> to that:
> - Instantiating SparkContext has limitations (e.g., you can only have one 
> active context at the moment, plus you lose the ability to submit apps in 
> cluster mode)
> - Calling SparkSubmit directly is doable but you lose a lot of the logic 
> handled by the shell scripts
> - Calling the shell script directly is doable,  but sort of ugly from an API 
> point of view.
> I think it would be nice to have a small library that handles that for users. 
> On top of that, this library could be used by Spark itself to replace a lot 
> of the code in the current shell scripts, which have a lot of duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4859) Refactor LiveListenerBus and StreamingListenerBus

2015-01-12 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-4859:

Description: 
[#4006|https://github.com/apache/spark/pull/4006] refactors LiveListenerBus and 
StreamingListenerBus and extracts the common codes to a parent class 
ListenerBus.

It also includes bug fixes in [#3710|https://github.com/apache/spark/pull/3710]:
1. Fix the race condition of queueFullErrorMessageLogged in LiveListenerBus and 
StreamingListenerBus to avoid outputing queue-full-error logs multiple times.
2. Make sure the SHUTDOWN message will be delivered to listenerThread, so that 
we can make sure listenerThread will always be able to exit.
3. Log the error from listener rather than crashing listenerThread in 
StreamingListenerBus.

During fixing the above bugs, we find it's better to make LiveListenerBus and 
StreamingListenerBus have the same bahaviors. Then there will be many 
duplicated codes in LiveListenerBus and StreamingListenerBus.

Therefore, I extracted their common codes to ListenerBus as a parent class: 
LiveListenerBus and StreamingListenerBus only need to extend ListenerBus and 
implement onPostEvent (how to process an event) and onDropEvent (do something 
when droppping an event).

  was:
Fix the race condition of `queueFullErrorMessageLogged`.
Log the error from listener rather than crashing `listenerThread`.

Summary: Refactor LiveListenerBus and StreamingListenerBus  (was: 
Improve StreamingListenerBus)

> Refactor LiveListenerBus and StreamingListenerBus
> -
>
> Key: SPARK-4859
> URL: https://issues.apache.org/jira/browse/SPARK-4859
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Shixiong Zhu
>
> [#4006|https://github.com/apache/spark/pull/4006] refactors LiveListenerBus 
> and StreamingListenerBus and extracts the common codes to a parent class 
> ListenerBus.
> It also includes bug fixes in 
> [#3710|https://github.com/apache/spark/pull/3710]:
> 1. Fix the race condition of queueFullErrorMessageLogged in LiveListenerBus 
> and StreamingListenerBus to avoid outputing queue-full-error logs multiple 
> times.
> 2. Make sure the SHUTDOWN message will be delivered to listenerThread, so 
> that we can make sure listenerThread will always be able to exit.
> 3. Log the error from listener rather than crashing listenerThread in 
> StreamingListenerBus.
> During fixing the above bugs, we find it's better to make LiveListenerBus and 
> StreamingListenerBus have the same bahaviors. Then there will be many 
> duplicated codes in LiveListenerBus and StreamingListenerBus.
> Therefore, I extracted their common codes to ListenerBus as a parent class: 
> LiveListenerBus and StreamingListenerBus only need to extend ListenerBus and 
> implement onPostEvent (how to process an event) and onDropEvent (do something 
> when droppping an event).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets

2015-01-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-5056.

Resolution: Won't Fix

> Implementing Clara k-medoids clustering algorithm for large datasets
> 
>
> Key: SPARK-5056
> URL: https://issues.apache.org/jira/browse/SPARK-5056
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Tomislav Milinovic
>Priority: Minor
>  Labels: features
>
> There is a specific k-medoids clustering algorithm for large datasets. The 
> algorithm is called Clara in R, and is fully described in chapter 3 of 
> Finding Groups in Data: An Introduction to Cluster Analysis. by Kaufman, L 
> and Rousseeuw, PJ (1990). 
> The algorithm considers sub-datasets of fixed size (sampsize) such that the 
> time and storage requirements become linear in n rather than quadratic. Each 
> sub-dataset is partitioned into k clusters using the same algorithm as in 
> Partinioning around Medoids (PAM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets

2015-01-12 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274568#comment-14274568
 ] 

Xiangrui Meng commented on SPARK-5056:
--

This is along the same direction with our discussion in SPARK-4510. If we 
choose a sample, is there any theoretical guarantee on the convergence? If we 
have 1 billion instances, what sample size would be proper? The original paper 
https://lirias.kuleuven.be/handle/123456789/426399, if I found the correct one, 
hasn't received many citations. 

In general, I think this algorithm is out of MLlib's scope. If someone is 
interested in implementing this algorithm, it would be best maintained outside 
Spark as a 3rd-party package. I'm going to mark it as "Won't Fix", but feel 
free to reopen it if there are things I missed.

> Implementing Clara k-medoids clustering algorithm for large datasets
> 
>
> Key: SPARK-5056
> URL: https://issues.apache.org/jira/browse/SPARK-5056
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Tomislav Milinovic
>Priority: Minor
>  Labels: features
>
> There is a specific k-medoids clustering algorithm for large datasets. The 
> algorithm is called Clara in R, and is fully described in chapter 3 of 
> Finding Groups in Data: An Introduction to Cluster Analysis. by Kaufman, L 
> and Rousseeuw, PJ (1990). 
> The algorithm considers sub-datasets of fixed size (sampsize) such that the 
> time and storage requirements become linear in n rather than quadratic. Each 
> sub-dataset is partitioned into k clusters using the same algorithm as in 
> Partinioning around Medoids (PAM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5211) Restore HiveMetastoreTypes.toDataType

2015-01-12 Thread Yin Huai (JIRA)
Yin Huai created SPARK-5211:
---

 Summary: Restore HiveMetastoreTypes.toDataType
 Key: SPARK-5211
 URL: https://issues.apache.org/jira/browse/SPARK-5211
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai


It was a public API. Since developers are using it, we need to get it back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5211) Restore HiveMetastoreTypes.toDataType

2015-01-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5211:

Priority: Critical  (was: Major)

> Restore HiveMetastoreTypes.toDataType
> -
>
> Key: SPARK-5211
> URL: https://issues.apache.org/jira/browse/SPARK-5211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> It was a public API. Since developers are using it, we need to get it back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5210) Support log rolling in EventLogger

2015-01-12 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-5210:
-

 Summary: Support log rolling in EventLogger
 Key: SPARK-5210
 URL: https://issues.apache.org/jira/browse/SPARK-5210
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Web UI
Reporter: Josh Rosen
Assignee: Josh Rosen


For long-running Spark applications (e.g. running for days / weeks), the Spark 
event log may grow to be very large.

As a result, it would be useful if EventLoggingListener supported log file 
rolling / rotation.  Adding this feature will involve changes to the 
HistoryServer in order to be able to load event logs from a sequence of files 
instead of a single file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-12 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274535#comment-14274535
 ] 

Nicholas Chammas commented on SPARK-3821:
-

That's correct. All those paths are just relative to the folder containing 
{{spark-packer.json}}.

> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---
>
> Key: SPARK-3821
> URL: https://issues.apache.org/jira/browse/SPARK-3821
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
> Attachments: packer-proposal.html
>
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5209) Jobs fail with "unexpected value" exception in certain environments

2015-01-12 Thread Sven Krasser (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sven Krasser updated SPARK-5209:

Attachment: spark-defaults.conf
repro.py
gen_test_data.py
exec_log.txt
driver_log.txt

> Jobs fail with "unexpected value" exception in certain environments
> ---
>
> Key: SPARK-5209
> URL: https://issues.apache.org/jira/browse/SPARK-5209
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: Amazon Elastic Map Reduce
>Reporter: Sven Krasser
> Attachments: driver_log.txt, exec_log.txt, gen_test_data.py, 
> repro.py, spark-defaults.conf
>
>
> Jobs fail consistently and reproducibly with exceptions of the following type 
> in PySpark using Spark 1.2.0:
> {noformat}
> 2015-01-13 00:14:05,898 ERROR [Executor task launch worker-1] 
> executor.Executor (Logging.scala:logError(96)) - Exception in task 27.0 in 
> stage 0.0 (TID 28)
> org.apache.spark.SparkException: PairwiseRDD: unexpected value: 
> List([B@4c09f3e0)
> {noformat}
> The issue appeared the first time in Spark 1.2.0 and is sensitive to the 
> environment (configuration, cluster size), i.e. some changes to the 
> environment will cause the error to not occur.
> The following steps yield a reproduction on Amazon Elastic Map Reduce. Launch 
> an EMR cluster with the following parameters (this will bootstrap Spark 1.2.0 
> onto it):
> {code}
> aws emr create-cluster --region us-west-1 --no-auto-terminate \
>--ec2-attributes KeyName=your-key-here,SubnetId=your-subnet-here \
>--bootstrap-actions 
> Path=s3://support.elasticmapreduce/spark/install-spark,Args='["-g","-v","1.2.0.a"]'
>  \
>--ami-version 3.3 --instance-groups 
> InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
>InstanceGroupType=CORE,InstanceCount=3,InstanceType=r3.xlarge --name 
> "Spark Issue Repro" \
>--visible-to-all-users --applications Name=Ganglia
> {code}
> Next, copy the attached {{spark-defaults.conf}} to {{~/spark/conf/}}.
> Run {{~/spark/bin/spark-submit gen_test_data.py}} to generate a test data set 
> on HDFS. Then lastly run {{~/spark/bin/spark-submit repro.py}} to reproduce 
> the error.
> Driver and executor logs are attached. For reference, a spark-user thread on 
> the topic is here: 
> http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3cc5a80834-8f1c-4c0a-89f9-e04d3f1c4...@gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4959) Attributes are case sensitive when using a select query from a projection

2015-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274524#comment-14274524
 ] 

Apache Spark commented on SPARK-4959:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/4013

> Attributes are case sensitive when using a select query from a projection
> -
>
> Key: SPARK-4959
> URL: https://issues.apache.org/jira/browse/SPARK-4959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andy Konwinski
>Priority: Critical
>
> Per [~marmbrus], see this line of code, where we should be using an attribute 
> map
>  
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147
> To reproduce, i ran the following in the Spark shell:
> {code}
> import sqlContext._
> sql("drop table if exists test")
> sql("create table test (col1 string)")
> sql("""insert into table test select "hi" from prejoined limit 1""")
> val projection = "col1".attr.as(Symbol("CaseSensitiveColName")) :: 
> "col1".attr.as(Symbol("CaseSensitiveColName2")) :: Nil
> sqlContext.table("test").select(projection:_*).registerTempTable("test2")
> # This succeeds.
> sql("select CaseSensitiveColName from test2").first()
> # This fails with java.util.NoSuchElementException: key not found: 
> casesensitivecolname#23046
> sql("select casesensitivecolname from test2").first()
> {code}
> The full stack trace printed for the final command that is failing: 
> {code}
> java.util.NoSuchElementException: key not found: casesensitivecolname#23046
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:57)
>   at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
>   at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
>   at 
> org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378)
>   at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
>   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
>   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446)
>   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108)
>   at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

--

[jira] [Created] (SPARK-5209) Jobs fail with "unexpected value" exception in certain environments

2015-01-12 Thread Sven Krasser (JIRA)
Sven Krasser created SPARK-5209:
---

 Summary: Jobs fail with "unexpected value" exception in certain 
environments
 Key: SPARK-5209
 URL: https://issues.apache.org/jira/browse/SPARK-5209
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Amazon Elastic Map Reduce
Reporter: Sven Krasser


Jobs fail consistently and reproducibly with exceptions of the following type 
in PySpark using Spark 1.2.0:

{noformat}
2015-01-13 00:14:05,898 ERROR [Executor task launch worker-1] executor.Executor 
(Logging.scala:logError(96)) - Exception in task 27.0 in stage 0.0 (TID 28)
org.apache.spark.SparkException: PairwiseRDD: unexpected value: 
List([B@4c09f3e0)
{noformat}

The issue appeared the first time in Spark 1.2.0 and is sensitive to the 
environment (configuration, cluster size), i.e. some changes to the environment 
will cause the error to not occur.

The following steps yield a reproduction on Amazon Elastic Map Reduce. Launch 
an EMR cluster with the following parameters (this will bootstrap Spark 1.2.0 
onto it):
{code}
aws emr create-cluster --region us-west-1 --no-auto-terminate \
   --ec2-attributes KeyName=your-key-here,SubnetId=your-subnet-here \
   --bootstrap-actions 
Path=s3://support.elasticmapreduce/spark/install-spark,Args='["-g","-v","1.2.0.a"]'
 \
   --ami-version 3.3 --instance-groups 
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
   InstanceGroupType=CORE,InstanceCount=3,InstanceType=r3.xlarge --name "Spark 
Issue Repro" \
   --visible-to-all-users --applications Name=Ganglia
{code}

Next, copy the attached {{spark-defaults.conf}} to {{~/spark/conf/}}.

Run {{~/spark/bin/spark-submit gen_test_data.py}} to generate a test data set 
on HDFS. Then lastly run {{~/spark/bin/spark-submit repro.py}} to reproduce the 
error.

Driver and executor logs are attached. For reference, a spark-user thread on 
the topic is here: 
http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3cc5a80834-8f1c-4c0a-89f9-e04d3f1c4...@gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3433) Mima false-positives with @DeveloperAPI and @Experimental annotations

2015-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274507#comment-14274507
 ] 

Apache Spark commented on SPARK-3433:
-

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/2285

> Mima false-positives with @DeveloperAPI and @Experimental annotations
> -
>
> Key: SPARK-3433
> URL: https://issues.apache.org/jira/browse/SPARK-3433
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Josh Rosen
>Assignee: Prashant Sharma
>Priority: Minor
> Fix For: 1.2.0, 1.1.2
>
>
> In https://github.com/apache/spark/pull/2315, I found two cases where 
> {{@DeveloperAPI}} and {{@Experimental}} annotations didn't prevent 
> false-positive warnings from Mima.  To reproduce this problem, run dev/mima 
> as of 
> https://github.com/JoshRosen/spark/commit/ec90e21947b615d4ef94a3a54cfd646924ccaf7c.
>   The spurious warnings are listed at the top of 
> https://gist.github.com/JoshRosen/5d8df835516dc367389d.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3433) Mima false-positives with @DeveloperAPI and @Experimental annotations

2015-01-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3433:
--
Affects Version/s: 1.1.0
Fix Version/s: 1.1.2

I've backported this to {{branch-1.1}} in order to fix a MiMa false-positive in 
that branch.

> Mima false-positives with @DeveloperAPI and @Experimental annotations
> -
>
> Key: SPARK-3433
> URL: https://issues.apache.org/jira/browse/SPARK-3433
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Josh Rosen
>Assignee: Prashant Sharma
>Priority: Minor
> Fix For: 1.2.0, 1.1.2
>
>
> In https://github.com/apache/spark/pull/2315, I found two cases where 
> {{@DeveloperAPI}} and {{@Experimental}} annotations didn't prevent 
> false-positive warnings from Mima.  To reproduce this problem, run dev/mima 
> as of 
> https://github.com/JoshRosen/spark/commit/ec90e21947b615d4ef94a3a54cfd646924ccaf7c.
>   The spurious warnings are listed at the top of 
> https://gist.github.com/JoshRosen/5d8df835516dc367389d.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5208) Add more documentation to Netty-based configs

2015-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274489#comment-14274489
 ] 

Apache Spark commented on SPARK-5208:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/4012

>  Add more documentation to Netty-based configs
> --
>
> Key: SPARK-5208
> URL: https://issues.apache.org/jira/browse/SPARK-5208
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>
> SPARK-4864 added some documentation about Netty-based configs but I think we 
> need more. I think following configs can be useful for performance tuning.
> * spark.shuffle.io.mode
> * spark.shuffle.io.backLog
> * spark.shuffle.io.receiveBuffer
> * spark.shuffle.io.sendBuffer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5208) Add more documentation to Netty-based configs

2015-01-12 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-5208:
-

 Summary:  Add more documentation to Netty-based configs
 Key: SPARK-5208
 URL: https://issues.apache.org/jira/browse/SPARK-5208
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Kousuke Saruta


SPARK-4864 added some documentation about Netty-based configs but I think we 
need more. I think following configs can be useful for performance tuning.

* spark.shuffle.io.mode
* spark.shuffle.io.backLog
* spark.shuffle.io.receiveBuffer
* spark.shuffle.io.sendBuffer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5208) Add more documentation to Netty-based configs

2015-01-12 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-5208:
--
Issue Type: Improvement  (was: Bug)

>  Add more documentation to Netty-based configs
> --
>
> Key: SPARK-5208
> URL: https://issues.apache.org/jira/browse/SPARK-5208
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>
> SPARK-4864 added some documentation about Netty-based configs but I think we 
> need more. I think following configs can be useful for performance tuning.
> * spark.shuffle.io.mode
> * spark.shuffle.io.backLog
> * spark.shuffle.io.receiveBuffer
> * spark.shuffle.io.sendBuffer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4924) Factor out code to launch Spark applications into a separate library

2015-01-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4924:
-
Affects Version/s: (was: 1.2.0)
   1.0.0

> Factor out code to launch Spark applications into a separate library
> 
>
> Key: SPARK-4924
> URL: https://issues.apache.org/jira/browse/SPARK-4924
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: spark-launcher.txt
>
>
> One of the questions we run into rather commonly is "how to start a Spark 
> application from my Java/Scala program?". There currently isn't a good answer 
> to that:
> - Instantiating SparkContext has limitations (e.g., you can only have one 
> active context at the moment, plus you lose the ability to submit apps in 
> cluster mode)
> - Calling SparkSubmit directly is doable but you lose a lot of the logic 
> handled by the shell scripts
> - Calling the shell script directly is doable,  but sort of ugly from an API 
> point of view.
> I think it would be nice to have a small library that handles that for users. 
> On top of that, this library could be used by Spark itself to replace a lot 
> of the code in the current shell scripts, which have a lot of duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4924) Factor out code to launch Spark applications into a separate library

2015-01-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4924:
-
Affects Version/s: 1.2.0

> Factor out code to launch Spark applications into a separate library
> 
>
> Key: SPARK-4924
> URL: https://issues.apache.org/jira/browse/SPARK-4924
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: spark-launcher.txt
>
>
> One of the questions we run into rather commonly is "how to start a Spark 
> application from my Java/Scala program?". There currently isn't a good answer 
> to that:
> - Instantiating SparkContext has limitations (e.g., you can only have one 
> active context at the moment, plus you lose the ability to submit apps in 
> cluster mode)
> - Calling SparkSubmit directly is doable but you lose a lot of the logic 
> handled by the shell scripts
> - Calling the shell script directly is doable,  but sort of ugly from an API 
> point of view.
> I think it would be nice to have a small library that handles that for users. 
> On top of that, this library could be used by Spark itself to replace a lot 
> of the code in the current shell scripts, which have a lot of duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5053) Test maintenance branches on Jenkins using SBT

2015-01-12 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274472#comment-14274472
 ] 

Josh Rosen commented on SPARK-5053:
---

I fixed the {{branch-1.1}} PySpark issue in 
https://github.com/apache/spark/pull/4011 and now have to fix a MiMa issue.

> Test maintenance branches on Jenkins using SBT
> --
>
> Key: SPARK-5053
> URL: https://issues.apache.org/jira/browse/SPARK-5053
> Project: Spark
>  Issue Type: New Feature
>  Components: Project Infra
>Reporter: Josh Rosen
>Priority: Blocker
>
> We need to create Jenkins jobs to test maintenance branches using SBT.  The 
> current Maven jobs for backport branches do not run the same checks that the 
> pull request builder / SBT builds do (e.g. MiMa checks, PySpark, RAT, etc.) 
> which means that cherry-picking backports can silently break things and we'll 
> only discover it once PRs that are explicitly opened against those branches 
> fail tests; this long delay between introducing test failures and detecting 
> them is a huge productivity issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution

2015-01-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3910.
---
  Resolution: Fixed
Target Version/s: 1.2.0, 1.1.2  (was: 1.2.0)

I backported Davies' 1.2 fix to branch-1.1, so I think we can mark this issue 
as resolved: https://github.com/apache/spark/pull/4011

> ./python/pyspark/mllib/classification.py doctests fails with module name 
> pollution
> --
>
> Key: SPARK-3910
> URL: https://issues.apache.org/jira/browse/SPARK-3910
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, 
> Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, 
> argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, 
> pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, 
> unittest2==0.5.1, wsgiref==0.1.2
>Reporter: Tomohiko K.
>  Labels: pyspark, testing
>
> In ./python/run-tests script, we run the doctests in 
> ./pyspark/mllib/classification.py.
> The output is as following:
> {noformat}
> $ ./python/run-tests
> ...
> Running test: pyspark/mllib/classification.py
> Traceback (most recent call last):
>   File "pyspark/mllib/classification.py", line 20, in 
> import numpy
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py",
>  line 170, in 
> from . import add_newdocs
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py",
>  line 13, in 
> from numpy.lib import add_newdoc
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py",
>  line 8, in 
> from .type_check import *
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py",
>  line 11, in 
> import numpy.core.numeric as _nx
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py",
>  line 46, in 
> from numpy.testing import Tester
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py",
>  line 13, in 
> from .utils import *
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py",
>  line 15, in 
> from tempfile import mkdtemp
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py",
>  line 34, in 
> from random import Random as _Random
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py", 
> line 24, in 
> from pyspark.rdd import RDD
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py", line 
> 51, in 
> from pyspark.context import SparkContext
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py", line 
> 22, in 
> from tempfile import NamedTemporaryFile
> ImportError: cannot import name NamedTemporaryFile
> 0.07 real 0.04 user 0.02 sys
> Had test failures; see logs.
> {noformat}
> The problem is a cyclic import of tempfile module.
> The cause of it is that pyspark.mllib.random module exists in the directory 
> where pyspark.mllib.classification module exists.
> classification module imports numpy module, and then numpy module imports 
> tempfile module from its inside.
> Now the first entry sys.path is the directory "./python/pyspark/mllib" (where 
> the executed file "classification.py" exists), so tempfile module imports 
> pyspark.mllib.random module (not the standard library "random" module).
> Finally, import chains reach tempfile again, then a cyclic import is formed.
> Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile 
> → (cyclic import!!)
> Furthermore, stat module is in a standard library, and pyspark.mllib.stat 
> module exists. This also may be troublesome.
> commit: 0e8203f4fb721158fb27897680da476174d24c4b
> A fundamental solution is to avoid using module names used by standard 
> libraries (currently "random" and "stat").
> A difficulty of this solution is to rename pyspark.mllib.random and 
> pyspark.mllib.stat, which may be already used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4348) pyspark.mllib.random conflicts with random module

2015-01-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4348:
--
Fix Version/s: 1.1.0

I've also fixed this in 1.1.2 by backporting the 1.2 patch:

https://github.com/apache/spark/pull/4011

> pyspark.mllib.random conflicts with random module
> -
>
> Key: SPARK-4348
> URL: https://issues.apache.org/jira/browse/SPARK-4348
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.1.0, 1.2.0
>
>
> There are conflict in two cases:
> 1. random module is used by pyspark.mllib.feature, if the first part of 
> sys.path is not '', then the hack in pyspark/__init__.py will fail to fix the 
> conflict.
> 2. Run tests in mllib/xxx.py, the '' should be popped out before import 
> anything, or it will fail.
> The first one is not fully fixed for user, it will introduce problems in some 
> cases, such as:
> {code}
> >>> import sys
> >>> import sys.insert(0, PATH_OF_MODULE)
> >>> import pyspark
> >>> # use Word2Vec will fail
> {code}
> I'd like to rename mllib/random.py as random/_random.py, then in 
> mllib/__init.py
> {code}
> import pyspark.mllib._random as random
> {code}
> cc [~mengxr] [~dorx]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5049) ParquetTableScan always prepends the values of partition columns in output rows irrespective of the order of the partition columns in the original SELECT query

2015-01-12 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5049.
-
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

Issue resolved by pull request 3990
[https://github.com/apache/spark/pull/3990]

> ParquetTableScan always prepends the values of partition columns in output 
> rows irrespective of the order of the partition columns in the original 
> SELECT query
> ---
>
> Key: SPARK-5049
> URL: https://issues.apache.org/jira/browse/SPARK-5049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Rahul Aggarwal
> Fix For: 1.3.0, 1.2.1
>
>
> This happens when ParquetTableScan is being used by turning on 
> spark.sql.hive.convertMetastoreParquet
> For example:
> spark-sql> set spark.sql.hive.convertMetastoreParquet=true;
> spark-sql> create table table1(a int , b int) partitioned by (p1 string, p2 
> int) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS  
> INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 
> 'parquet.hive.DeprecatedParquetOutputFormat';
> spark-sql> insert into table table1 partition(p1='January',p2=1) select key, 
> 10  from src;
> spark-sql> select a, b, p1, p2 from table1 limit 10;
> January   1   484 10
> January   1   484 10
> January   1   484 10
> January   1   484 10
> January   1   484 10
> January   1   484 10
> January   1   484 10
> January   1   484 10
> January   1   484 10
> January   1   484 10
> The correct output should be 
> 484   10  January 1
> 484   10  January 1
> 484   10  January 1
> 484   10  January 1
> 484   10  January 1
> 484   10  January 1
> 484   10  January 1
> 484   10  January 1
> 484   10  January 1
> 484   10  January 1
> This also leads to schema mismatch if the query is run using HiveContext and 
> the result is a SchemaRDD.
> For example :
> scala> import org.apache.spark.sql.hive._
> scala> val hc = new HiveContext(sc)
> scala> hc.setConf("spark.sql.hive.convertMetastoreParquet", "true")
> scala> val res = hc.sql("select a, b, p1, p2 from table1 limit 10")
> scala> res.collect
> res2: Array[org.apache.spark.sql.Row] = Array([January,1,238,10], 
> [January,1,86,10], [January,1,311,10], [January,1,27,10], [January,1,165,10], 
> [January,1,409,10], [January,1,255,10], [January,1,278,10], 
> [January,1,98,10], [January,1,484,10])
> scala> res.schema
> res5: org.apache.spark.sql.StructType = 
> StructType(ArrayBuffer(StructField(a,IntegerType,true), 
> StructField(b,IntegerType,true), StructField(p1,StringType,true), 
> StructField(p2,IntegerType,true)))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2015-01-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-1239:
--
Assignee: (was: Josh Rosen)

> Don't fetch all map output statuses at each reducer during shuffles
> ---
>
> Key: SPARK-1239
> URL: https://issues.apache.org/jira/browse/SPARK-1239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Patrick Wendell
>
> Instead we should modify the way we fetch map output statuses to take both a 
> mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4348) pyspark.mllib.random conflicts with random module

2015-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274333#comment-14274333
 ] 

Apache Spark commented on SPARK-4348:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4011

> pyspark.mllib.random conflicts with random module
> -
>
> Key: SPARK-4348
> URL: https://issues.apache.org/jira/browse/SPARK-4348
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.2.0
>
>
> There are conflict in two cases:
> 1. random module is used by pyspark.mllib.feature, if the first part of 
> sys.path is not '', then the hack in pyspark/__init__.py will fail to fix the 
> conflict.
> 2. Run tests in mllib/xxx.py, the '' should be popped out before import 
> anything, or it will fail.
> The first one is not fully fixed for user, it will introduce problems in some 
> cases, such as:
> {code}
> >>> import sys
> >>> import sys.insert(0, PATH_OF_MODULE)
> >>> import pyspark
> >>> # use Word2Vec will fail
> {code}
> I'd like to rename mllib/random.py as random/_random.py, then in 
> mllib/__init.py
> {code}
> import pyspark.mllib._random as random
> {code}
> cc [~mengxr] [~dorx]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5207) StandardScalerModel mean and variance re-use

2015-01-12 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274335#comment-14274335
 ] 

Xiangrui Meng commented on SPARK-5207:
--

[~ogeagla] I've assigned this ticket to you. Now the constructor takes 
withMean, withStd, mean, and std. We may want to consider whether we want to 
change the ordering of the parameters or provide auxiliary constructors. For 
example, we can have

StandardScalerModel(mean, std)

and then make withMean, withStd configurable via setters.

setWithMean
setWithStd

Just provide one option here. [~dbtsai] implemented this feature. He may want 
to add more.

> StandardScalerModel mean and variance re-use
> 
>
> Key: SPARK-5207
> URL: https://issues.apache.org/jira/browse/SPARK-5207
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Octavian Geagla
>Assignee: Octavian Geagla
>
> From this discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-td10073.html
> Changing constructor to public would be a simple change, but a discussion is 
> needed to determine what args necessary for this change.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4821) pyspark.mllib.rand docs not generated correctly

2015-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274334#comment-14274334
 ] 

Apache Spark commented on SPARK-4821:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4011

> pyspark.mllib.rand docs not generated correctly
> ---
>
> Key: SPARK-4821
> URL: https://issues.apache.org/jira/browse/SPARK-4821
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, MLlib, PySpark
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 1.3.0, 1.2.1
>
>
> spark/python/docs/pyspark.mllib.rst needs to be updated to reflect the change 
> in package names from pyspark.mllib.random to .rand
> Otherwise, the Python API docs are empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5207) StandardScalerModel mean and variance re-use

2015-01-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5207:
-
Target Version/s: 1.3.0

> StandardScalerModel mean and variance re-use
> 
>
> Key: SPARK-5207
> URL: https://issues.apache.org/jira/browse/SPARK-5207
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Octavian Geagla
>
> From this discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-td10073.html
> Changing constructor to public would be a simple change, but a discussion is 
> needed to determine what args necessary for this change.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5207) StandardScalerModel mean and variance re-use

2015-01-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5207:
-
Issue Type: Improvement  (was: Wish)

> StandardScalerModel mean and variance re-use
> 
>
> Key: SPARK-5207
> URL: https://issues.apache.org/jira/browse/SPARK-5207
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Octavian Geagla
>
> From this discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-td10073.html
> Changing constructor to public would be a simple change, but a discussion is 
> needed to determine what args necessary for this change.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5207) StandardScalerModel mean and variance re-use

2015-01-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5207:
-
Assignee: Octavian Geagla

> StandardScalerModel mean and variance re-use
> 
>
> Key: SPARK-5207
> URL: https://issues.apache.org/jira/browse/SPARK-5207
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Octavian Geagla
>Assignee: Octavian Geagla
>
> From this discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-td10073.html
> Changing constructor to public would be a simple change, but a discussion is 
> needed to determine what args necessary for this change.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4667) Spillable can request more than twice its current memory from pool

2015-01-12 Thread Ryan Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Williams closed SPARK-4667.

Resolution: Not a Problem

> Spillable can request more than twice its current memory from pool
> --
>
> Key: SPARK-4667
> URL: https://issues.apache.org/jira/browse/SPARK-4667
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Ryan Williams
>
> [Spillable|https://github.com/apache/spark/blob/0eb4a7fb0fa1fa56677488cbd74eb39e65317621/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L78]
>  has a comment that says "{{Claim up to double our current memory from the 
> shuffle memory pool}}", but then it proceeds to request {{2 * currentMemory - 
> myMemoryThreshold}}, which can more than double its current memory amount. 
> The requested amount (or the comment) should be changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4004) add akka-persistence based recovery mechanism for Master (maybe Worker)

2015-01-12 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274299#comment-14274299
 ] 

Nan Zhu edited comment on SPARK-4004 at 1/12/15 10:30 PM:
--

I'd close the PR as I saw some discussions in 
https://github.com/apache/spark/pull/3825 which stated that we would introduce 
less Akka's feature to make it easier to replace Akka with Spark's RPC framework


was (Author: codingcat):
I'd close the PR as I saw some discussions in 
https://github.com/apache/spark/pull/3825 which stated that we would introduce 
less Akka's feature to make it easier to replace Akka with Spark's own RPC 
framework

> add akka-persistence based recovery mechanism for Master (maybe Worker)
> ---
>
> Key: SPARK-4004
> URL: https://issues.apache.org/jira/browse/SPARK-4004
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Nan Zhu
>
> Since we have upgraded akka version to 2.3.x
> we can utilize the features which are actually helpful in many applications, 
> e.g. by using persistence we can add akka-persistence recovery mechanism to 
> Master (maybe also Worker, but I'm not sure if we have many things to recover 
> from that)
> this would be with better performance and more flexibility than current File 
> based persistence Engine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4004) add akka-persistence based recovery mechanism for Master (maybe Worker)

2015-01-12 Thread Nan Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu closed SPARK-4004.
--
Resolution: Won't Fix

I'd close the PR as I saw some discussions in 
https://github.com/apache/spark/pull/3825 which stated that we would introduce 
less Akka's feature to make it easier to replace Akka with Spark's own RPC 
framework

> add akka-persistence based recovery mechanism for Master (maybe Worker)
> ---
>
> Key: SPARK-4004
> URL: https://issues.apache.org/jira/browse/SPARK-4004
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Nan Zhu
>
> Since we have upgraded akka version to 2.3.x
> we can utilize the features which are actually helpful in many applications, 
> e.g. by using persistence we can add akka-persistence recovery mechanism to 
> Master (maybe also Worker, but I'm not sure if we have many things to recover 
> from that)
> this would be with better performance and more flexibility than current File 
> based persistence Engine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-12 Thread Chip Senkbeil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274268#comment-14274268
 ] 

Chip Senkbeil commented on SPARK-4923:
--

Okay, I'll do that and update this JIRA once I've submitted the pull request.

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274263#comment-14274263
 ] 

Patrick Wendell commented on SPARK-4923:


[~senkwich] definitely prefer github.

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-12 Thread Chip Senkbeil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274253#comment-14274253
 ] 

Chip Senkbeil edited comment on SPARK-4923 at 1/12/15 10:07 PM:


[~pwendell], I can definitely do that. Would you prefer a patch in the same 
form as the one attached? Or would it be better to create a pull request on 
Github for this with the changes?


was (Author: senkwich):
[~pwendell], I can definitely do that. Would you prefer a patch in the same 
form as the one attached? Or would it be better to create a pull request for 
this?

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-12 Thread Chip Senkbeil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274253#comment-14274253
 ] 

Chip Senkbeil commented on SPARK-4923:
--

[~pwendell], I can definitely do that. Would you prefer a patch in the same 
form as the one attached? Or would it be better to create a pull request for 
this?

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274239#comment-14274239
 ] 

Patrick Wendell edited comment on SPARK-4923 at 1/12/15 9:58 PM:
-

Hey All,

Sorry this has caused a disruption. As I said in the earlier comment. if anyone 
on these projects can submit a patch that locks down the visibility in that 
package and opening up things that are specifically needed, I'm fine to keep 
publishing it (and do so retro-actively for 1.2). We just need to look closely 
at what we are exposing because this package currently violates Spark's API 
policy. Because the Scala repl does not itself offer any kind of API stability, 
it will be hard for Spark to do same. But I think it's fine to just annotate 
and expose unstable API's here, provided projects understand the implications 
of depending on them.

[~senkwich] - since you guys are probably the heaviest user, would you be 
willing to take a crack at this? Basically start by making everything private 
and then go and unlock things that you need as Developer API's.

- Patrick


was (Author: pwendell):
Hey All,

Sorry this has caused a disruption. As I said in the earlier comment. if anyone 
on these projects can submit a patch that locks down the visibility in that 
package and opening up things that are specifically needed, I'm fine to keep 
publishing it (and do so retro-actively for 1.2). We just need to look closely 
at what we are exposing because this package currently violates Spark's API 
policy. Because the Scala repl does not itself offer any kind of API stability, 
it will be hard for Spark to do same. But I think it's fine to just annotate 
and expose unstable API's here, provided projects understand the implications 
of depending on them.

Chi - since you guys are probably the heaviest user, would you be willing to 
take a crack at this? Basically start by making everything private and then go 
and unlock things that you need as Developer API's.

- Patrick

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274239#comment-14274239
 ] 

Patrick Wendell commented on SPARK-4923:


Hey All,

Sorry this has caused a disruption. As I said in the earlier comment. if anyone 
on these projects can submit a patch that locks down the visibility in that 
package and opening up things that are specifically needed, I'm fine to keep 
publishing it (and do so retro-actively for 1.2). We just need to look closely 
at what we are exposing because this package currently violates Spark's API 
policy. Because the Scala repl does not itself offer any kind of API stability, 
it will be hard for Spark to do same. But I think it's fine to just annotate 
and expose unstable API's here, provided projects understand the implications 
of depending on them.

Chi - since you guys are probably the heaviest user, would you be willing to 
take a crack at this? Basically start by making everything private and then go 
and unlock things that you need as Developer API's.

- Patrick

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4296) Throw "Expression not in GROUP BY" when using same expression in group by clause and select clause

2015-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274227#comment-14274227
 ] 

Apache Spark commented on SPARK-4296:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4010

> Throw "Expression not in GROUP BY" when using same expression in group by 
> clause and  select clause
> ---
>
> Key: SPARK-4296
> URL: https://issues.apache.org/jira/browse/SPARK-4296
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
>Reporter: Shixiong Zhu
>Assignee: Cheng Lian
>Priority: Blocker
>
> When the input data has a complex structure, using same expression in group 
> by clause and  select clause will throw "Expression not in GROUP BY".
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Birthday(date: String)
> case class Person(name: String, birthday: Birthday)
> val people = sc.parallelize(List(Person("John", Birthday("1990-01-22")), 
> Person("Jim", Birthday("1980-02-28"
> people.registerTempTable("people")
> val year = sqlContext.sql("select count(*), upper(birthday.date) from people 
> group by upper(birthday.date)")
> year.collect
> {code}
> Here is the plan of year:
> {code:java}
> SchemaRDD[3] at RDD at SchemaRDD.scala:105
> == Query Plan ==
> == Physical Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
> not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
> Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
> AS date#9) AS c1#3]
>  Subquery people
>   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
> ExistingRDD.scala:36
> {code}
> The bug is the equality test for `Upper(birthday#1.date)` and 
> `Upper(birthday#1.date AS date#9)`.
> Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
> expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5172) spark-examples-***.jar shades a wrong Hadoop distribution

2015-01-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5172.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Sean Owen

> spark-examples-***.jar shades a wrong Hadoop distribution
> -
>
> Key: SPARK-5172
> URL: https://issues.apache.org/jira/browse/SPARK-5172
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Shixiong Zhu
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.3.0
>
>
> Steps to check it:
> 1. Download  "spark-1.2.0-bin-hadoop2.4.tgz" from 
> http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.0/spark-1.2.0-bin-hadoop2.4.tgz
> 2. unzip `spark-examples-1.2.0-hadoop2.4.0.jar`.
> 3. There is a file called `org/apache/hadoop/package-info.class` in the jar. 
> It doesn't exist in hadoop 2.4. 
> 4. Run "javap -classpath . -private -c -v  org.apache.hadoop.package-info"
> {code}
> Compiled from "package-info.java"
> interface org.apache.hadoop.package-info
>   SourceFile: "package-info.java"
>   RuntimeVisibleAnnotations: length = 0x24
>00 01 00 06 00 06 00 07 73 00 08 00 09 73 00 0A
>00 0B 73 00 0C 00 0D 73 00 0E 00 0F 73 00 10 00
>11 73 00 12 
>   minor version: 0
>   major version: 50
>   Constant pool:
> const #1 = Asciz  org/apache/hadoop/package-info;
> const #2 = class  #1; //  "org/apache/hadoop/package-info"
> const #3 = Asciz  java/lang/Object;
> const #4 = class  #3; //  java/lang/Object
> const #5 = Asciz  package-info.java;
> const #6 = Asciz  Lorg/apache/hadoop/HadoopVersionAnnotation;;
> const #7 = Asciz  version;
> const #8 = Asciz  1.2.1;
> const #9 = Asciz  revision;
> const #10 = Asciz 1503152;
> const #11 = Asciz user;
> const #12 = Asciz mattf;
> const #13 = Asciz date;
> const #14 = Asciz Wed Jul 24 13:39:35 PDT 2013;
> const #15 = Asciz url;
> const #16 = Asciz 
> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2;
> const #17 = Asciz srcChecksum;
> const #18 = Asciz 6923c86528809c4e7e6f493b6b413a9a;
> const #19 = Asciz SourceFile;
> const #20 = Asciz RuntimeVisibleAnnotations;
> {
> }
> {code}
> The version is {{1.2.1}}
> It comes because a wrong hbase version settings in examples project. Here is 
> a part of the dependencly tree when runnning "mvn -Pyarn -Phadoop-2.4 
> -Dhadoop.version=2.4.0 -pl examples dependency:tree"
> {noformat}
> [INFO] +- org.apache.hbase:hbase-testing-util:jar:0.98.7-hadoop1:compile
> [INFO] |  +- 
> org.apache.hbase:hbase-common:test-jar:tests:0.98.7-hadoop1:compile
> [INFO] |  +- 
> org.apache.hbase:hbase-server:test-jar:tests:0.98.7-hadoop1:compile
> [INFO] |  |  +- com.sun.jersey:jersey-core:jar:1.8:compile
> [INFO] |  |  +- com.sun.jersey:jersey-json:jar:1.8:compile
> [INFO] |  |  |  +- org.codehaus.jettison:jettison:jar:1.1:compile
> [INFO] |  |  |  +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
> [INFO] |  |  |  \- org.codehaus.jackson:jackson-xc:jar:1.7.1:compile
> [INFO] |  |  \- com.sun.jersey:jersey-server:jar:1.8:compile
> [INFO] |  | \- asm:asm:jar:3.3.1:test
> [INFO] |  +- org.apache.hbase:hbase-hadoop1-compat:jar:0.98.7-hadoop1:compile
> [INFO] |  +- 
> org.apache.hbase:hbase-hadoop1-compat:test-jar:tests:0.98.7-hadoop1:compile
> [INFO] |  +- org.apache.hadoop:hadoop-core:jar:1.2.1:compile
> [INFO] |  |  +- xmlenc:xmlenc:jar:0.52:compile
> [INFO] |  |  +- commons-configuration:commons-configuration:jar:1.6:compile
> [INFO] |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
> [INFO] |  |  |  |  \- commons-beanutils:commons-beanutils:jar:1.7.0:compile
> [INFO] |  |  |  \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
> [INFO] |  |  \- commons-el:commons-el:jar:1.0:compile
> [INFO] |  +- org.apache.hadoop:hadoop-test:jar:1.2.1:compile
> [INFO] |  |  +- org.apache.ftpserver:ftplet-api:jar:1.0.0:compile
> [INFO] |  |  +- org.apache.mina:mina-core:jar:2.0.0-M5:compile
> [INFO] |  |  +- org.apache.ftpserver:ftpserver-core:jar:1.0.0:compile
> [INFO] |  |  \- org.apache.ftpserver:ftpserver-deprecated:jar:1.0.0-M2:compile
> [INFO] |  +- 
> com.github.stephenc.findbugs:findbugs-annotations:jar:1.3.9-1:compile
> [INFO] |  \- junit:junit:jar:4.10:test
> [INFO] | \- org.hamcrest:hamcrest-core:jar:1.1:test
> {noformat}
> If I ran `mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -pl examples -am 
> dependency:tree -Dhbase.profile=hadoop2`, the dependency tree is right.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5078) Allow setting Akka host name from env vars

2015-01-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5078.

   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

> Allow setting Akka host name from env vars
> --
>
> Key: SPARK-5078
> URL: https://issues.apache.org/jira/browse/SPARK-5078
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.3.0, 1.2.1
>
>
> Current spark lets you set the ip address using SPARK_LOCAL_IP, but then this 
> is given to akka after doing a reverse DNS lookup.  This makes it difficult 
> to run spark in Docker.  You can already change the hostname that is used 
> programmatically, but it would be nice to be able to do this with an 
> environment variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274050#comment-14274050
 ] 

Reynold Xin commented on SPARK-5097:


[~mohitjaggi] thanks for commenting. The implementation is actually pretty 
minor (it is mostly about finalizing the API). It would be great if you can 
review the design doc and chime in, and later on also review my initial pull 
request. Once the first pull request is in, I'm sure we will have more 
splittable tasks.

> Adding data frame APIs to SchemaRDD
> ---
>
> Key: SPARK-5097
> URL: https://issues.apache.org/jira/browse/SPARK-5097
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
> Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf
>
>
> SchemaRDD, through its DSL, already provides common data frame 
> functionalities. However, the DSL was originally created for constructing 
> test cases without much end-user usability and API stability consideration. 
> This design doc proposes a set of API changes for Scala and Python to make 
> the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5063) Raise more helpful errors when RDD actions or transformations are called inside of transformations

2015-01-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5063:
--
Description: 
Spark does not support nested RDDs or performing Spark actions inside of 
transformations; this usually leads to NullPointerExceptions (see SPARK-718 as 
one example).  The confusing NPE is one of the most common sources of Spark 
questions on StackOverflow:

- 
https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534
- 
https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399
- 
https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674

(those are just a sample of the ones that I've answered personally; there are 
many others).

I think we can detect these errors by adding logic to {{RDD}} to check whether 
{{sc}} is null (e.g. turn {{sc}} into a getter function); we can use this to 
add a better error message.

In PySpark, these errors manifest themselves slightly differently.  Attempting 
to nest RDDs or perform actions inside of transformations results in 
pickle-time errors:

{code}
rdd1 = sc.parallelize(range(100))
rdd2 = sc.parallelize(range(100))
rdd1.mapPartitions(lambda x: [rdd2.map(lambda x: x)])
{code}

produces

{code}
[...]
  File "/Users/joshrosen/anaconda/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
  File 
"/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
 line 304, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o21.__getnewargs__. 
Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{code}

We get the same error when attempting to broadcast an RDD in PySpark.  For 
Python, improved error reporting could be as simple as overriding the 
{{getnewargs}} method to throw a more useful UnsupportedOperation exception 
with a more helpful error message.

  was:
Spark does not support nested RDDs or performing Spark actions inside of 
transformations; this usually leads to NullPointerExceptions (see SPARK-718 as 
one example).  The confusing NPE is one of the most common sources of Spark 
questions on StackOverflow:

- 
https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534
- 
https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399
- 
https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674

(those are just a sample of the ones that I've answered personally; there are 
many others).

I think we can detect these errors by adding logic to {{RDD}} to check whether 
{{sc}} is null (e.g. turn {{sc}} into a getter function); we can use this to 
add a better error message.


> Raise more helpful errors when RDD actions or transformations are called 
> inside of transformations
> --
>
> Key: SPARK-5063
> URL: https://issues.apache.org/jira/browse/SPARK-5063
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Spark does not support nested RDDs or performing Spark actions inside of 
> transformations; this usually leads to NullPointerExceptions (see SPARK-718 
> as one example).  The confusing NPE is one of the most common sources of 
> Spark questions on StackOverflow:
> - 
> https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534
> - 
> https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399
> - 
> https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674
> (those are just a sample of the ones that I've answered personally; there are 
> many others).
> I think we can detect these errors by adding logic to {{RDD}} to check 
> whether {{sc}} is null (e.

[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-01-12 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274019#comment-14274019
 ] 

Josh Rosen commented on SPARK-4879:
---

I think that part of the reproduction issues that I had might have been due to 
{{attemptId}} returning a unique task attempt ID rather than the attempt 
number, meaning that only the _first_ run of that test in the REPL would be 
capable of uncovering the bug.

See https://github.com/apache/spark/pull/3849 / SPARK-4014 for more context.  
I'm going to try to merge that patch today, which will let me write a reliable 
regression test.

> Missing output partitions after job completes with speculative execution
> 
>
> Key: SPARK-4879
> URL: https://issues.apache.org/jira/browse/SPARK-4879
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Attachments: speculation.txt, speculation2.txt
>
>
> When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
> save output files may report that they have completed successfully even 
> though some output partitions written by speculative tasks may be missing.
> h3. Reproduction
> This symptom was reported to me by a Spark user and I've been doing my own 
> investigation to try to come up with an in-house reproduction.
> I'm still working on a reliable local reproduction for this issue, which is a 
> little tricky because Spark won't schedule speculated tasks on the same host 
> as the original task, so you need an actual (or containerized) multi-host 
> cluster to test speculation.  Here's a simple reproduction of some of the 
> symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
> spark.speculation=true}}:
> {code}
> // Rig a job such that all but one of the tasks complete instantly
> // and one task runs for 20 seconds on its first attempt and instantly
> // on its second attempt:
> val numTasks = 100
> sc.parallelize(1 to numTasks, 
> numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =>
>   if (ctx.partitionId == 0) {  // If this is the one task that should run 
> really slow
> if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
>  Thread.sleep(20 * 1000)
> }
>   }
>   iter
> }.map(x => (x, x)).saveAsTextFile("/test4")
> {code}
> When I run this, I end up with a job that completes quickly (due to 
> speculation) but reports failures from the speculated task:
> {code}
> [...]
> 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
> 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
> (100/100)
> 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
> :22) finished in 0.856 s
> 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
> :22, took 0.885438374 s
> 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
> for 70.1 in stage 3.0 because task 70 has already completed successfully
> scala> 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
> stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
> java.io.IOException: Failed to save output of task: 
> attempt_201412110141_0003_m_49_413
> 
> org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
> 
> org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
> 
> org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
> 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)
> 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> One interesting thing to note about this stack trace: if we look at 
> {{FileOutputCommitter.java:160}} 
> ([link|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.5.0-mr1-cdh5.2.0/org/apache/hadoop/mapred/FileOutputCommitter.java#160]),
>  this point in the execution seems to cor

[jira] [Created] (SPARK-5207) StandardScalerModel mean and variance re-use

2015-01-12 Thread Octavian Geagla (JIRA)
Octavian Geagla created SPARK-5207:
--

 Summary: StandardScalerModel mean and variance re-use
 Key: SPARK-5207
 URL: https://issues.apache.org/jira/browse/SPARK-5207
 Project: Spark
  Issue Type: Wish
  Components: MLlib
Reporter: Octavian Geagla


>From this discussion: 
>http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-td10073.html

Changing constructor to public would be a simple change, but a discussion is 
needed to determine what args necessary for this change.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-12 Thread Mohit Jaggi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273994#comment-14273994
 ] 

Mohit Jaggi commented on SPARK-5097:


Hi,
This is Mohit Jaggi, author of https://github.com/AyasdiOpenSource/bigdf 
Matei had suggested integrating bigdf with SchemaRDD and I was planning on 
doing that soon.
I would love to contribute to this item. Most of the constructs mentioned in 
the design document already exist in bigdf. 

Mohit.

> Adding data frame APIs to SchemaRDD
> ---
>
> Key: SPARK-5097
> URL: https://issues.apache.org/jira/browse/SPARK-5097
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
> Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf
>
>
> SchemaRDD, through its DSL, already provides common data frame 
> functionalities. However, the DSL was originally created for constructing 
> test cases without much end-user usability and API stability consideration. 
> This design doc proposes a set of API changes for Scala and Python to make 
> the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2584) Do not mutate block storage level on the UI

2015-01-12 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273877#comment-14273877
 ] 

Ilya Ganelin edited comment on SPARK-2584 at 1/12/15 7:08 PM:
--

Understood, I am able to recreate this issue in 1.1. I'll work on a fix to 
clarify what's going on. Thank.



was (Author: ilganeli):
Understood, I was looking at the UI for Spark 1.1 and did not see the block 
storage level represented as MEMORY_AND_DISK or DISK_ONLY. It's now presented 
as Memory Deserialized or Disk Deserialized. I'll attempt to recreate this 
problem in the newer version of Spark but wanted to know if you've seen it 
since 1.0.1. 

> Do not mutate block storage level on the UI
> ---
>
> Key: SPARK-2584
> URL: https://issues.apache.org/jira/browse/SPARK-2584
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.0.1
>Reporter: Andrew Or
>
> If a block is stored MEMORY_AND_DISK and we drop it from memory, it becomes 
> DISK_ONLY on the UI. We should preserve the original storage level  proposed 
> by the user, in addition to the change in actual storage level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-12 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273957#comment-14273957
 ] 

Shivaram Venkataraman commented on SPARK-3821:
--

Thanks [~nchammas] for benchmarks. This is looking good. Just curious about one 
thing in the spark-packer.json file. Where does the `create_image.sh` in 
https://github.com/nchammas/spark-ec2/blob/273c8c518fbc6e86e0fb4410efbe77a4d4e4ff5b/packer/spark-packer.json#L66
 come from ? Is it the same file in the current spark-ec2 repo ?


> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---
>
> Key: SPARK-3821
> URL: https://issues.apache.org/jira/browse/SPARK-3821
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
> Attachments: packer-proposal.html
>
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5102) CompressedMapStatus needs to be registered with Kryo

2015-01-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5102.

   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

Fixed by: https://github.com/apache/spark/pull/4007

> CompressedMapStatus needs to be registered with Kryo
> 
>
> Key: SPARK-5102
> URL: https://issues.apache.org/jira/browse/SPARK-5102
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Daniel Darabos
>Assignee: Lianhui Wang
>Priority: Minor
> Fix For: 1.3.0, 1.2.1
>
>
> After upgrading from Spark 1.1.0 to 1.2.0 I got this exception:
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Class is 
> not registered: org.apache.spark.scheduler.CompressedMapStatus
> Note: To register this class use: 
> kryo.register(org.apache.spark.scheduler.CompressedMapStatus.class);
>   at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
>   at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I had to register {{org.apache.spark.scheduler.CompressedMapStatus}} with 
> Kryo. I think this should be done in 
> {{spark/serializer/KryoSerializer.scala}}, unless instances of this class are 
> not expected to be sent over the wire. (Maybe I'm doing something wrong?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5102) CompressedMapStatus needs to be registered with Kryo

2015-01-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5102:
---
Target Version/s: 1.2.1
Assignee: Lianhui Wang

> CompressedMapStatus needs to be registered with Kryo
> 
>
> Key: SPARK-5102
> URL: https://issues.apache.org/jira/browse/SPARK-5102
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Daniel Darabos
>Assignee: Lianhui Wang
>Priority: Minor
>
> After upgrading from Spark 1.1.0 to 1.2.0 I got this exception:
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Class is 
> not registered: org.apache.spark.scheduler.CompressedMapStatus
> Note: To register this class use: 
> kryo.register(org.apache.spark.scheduler.CompressedMapStatus.class);
>   at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
>   at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I had to register {{org.apache.spark.scheduler.CompressedMapStatus}} with 
> Kryo. I think this should be done in 
> {{spark/serializer/KryoSerializer.scala}}, unless instances of this class are 
> not expected to be sent over the wire. (Maybe I'm doing something wrong?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5200) Disable web UI in Hive Thriftserver tests

2015-01-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5200.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0
   1.1.2

Issue resolved by pull request 3998
[https://github.com/apache/spark/pull/3998]

> Disable web UI in Hive Thriftserver tests
> -
>
> Key: SPARK-5200
> URL: https://issues.apache.org/jira/browse/SPARK-5200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>  Labels: flaky-test
> Fix For: 1.1.2, 1.3.0, 1.2.1
>
>
> In our unit tests, we should disable the Spark Web UI when starting the Hive 
> Thriftserver, since port contention during this test has been a cause of test 
> failures on Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint

2015-01-12 Thread vincent ye (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vincent ye updated SPARK-5206:
--
Description: 
I got exception as following while my streaming application restarts from crash 
from checkpoit:

15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR 
scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, 4)
java.util.NoSuchElementException: key not found: 1
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



I guess that an Accumulator is registered to a singleton Accumulators in Line 
58 of org.apache.spark.Accumulable:
Accumulators.register(this, true)
This code need to be executed in the driver once. But when the application is 
recovered from checkpoint. It won't be executed in the driver. So when the 
driver process it at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938),
 It can't find the Accumulator because it's not re-register during the recovery.


  was:
I got exception as following while my streaming application restarts from crash 
from checkpoit:

15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR 
scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, 4)
java.util.NoSuchElementException: key not found: 1
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




> Accumulators are not re-registered during recovering from checkpoint
> 
>
>   

[jira] [Commented] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint

2015-01-12 Thread vincent ye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273927#comment-14273927
 ] 

vincent ye commented on SPARK-5206:
---

I guess that an Accumulator is registered to a singleton Accumulators in Line 
58 of org.apache.spark.Accumulable:
Accumulators.register(this, true)

This code need to be executed in the driver once. But when the application is 
recovered from checkpoint. It won't be executed in the driver. So when the 
driver process it at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938),
 It can't find the Accumulator because it's not re-register during the recovery.

> Accumulators are not re-registered during recovering from checkpoint
> 
>
> Key: SPARK-5206
> URL: https://issues.apache.org/jira/browse/SPARK-5206
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: vincent ye
>
> I got exception as following while my streaming application restarts from 
> crash from checkpoit:
> 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR 
> scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, 
> 4)
> java.util.NoSuchElementException: key not found: 1
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3450) Enable specifiying the --jars CLI option multiple times

2015-01-12 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273924#comment-14273924
 ] 

Marcelo Vanzin commented on SPARK-3450:
---

[~pwendell] if your only concern is complicating the parsing, this is probably 
a one-line change in SparkSubmitArgument.scala. It wouldn't complicate anything.

> Enable specifiying the --jars CLI option multiple times
> ---
>
> Key: SPARK-3450
> URL: https://issues.apache.org/jira/browse/SPARK-3450
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.2
>Reporter: wolfgang hoschek
>
> spark-submit should support specifiying the --jars option multiple time, e.g. 
> --jars foo.jar,bar.jar --jars baz.jar,oops.jar should be equivalent to --jars 
> foo.jar,bar.jar,baz.jar,oops.jar
> This would allow using wrapper scripts that simplify usage for enterprise 
> customers along the following lines:
> {code}
> my-spark-submit.sh:
> jars=
> for i in /opt/myapp/*.jar; do
>   if [ $i -gt 0]
>   then
> jars="$jars,"
>   fi
>   jars="$jars$i"
> done
> spark-submit --jars "$jars" "$@"
> {code}
> Example usage:
> {code}
> my-spark-submit.sh --jars myUserDefinedFunction.jar 
> {code}
> The relevant enhancement code might go into SparkSubmitArguments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint

2015-01-12 Thread vincent ye (JIRA)
vincent ye created SPARK-5206:
-

 Summary: Accumulators are not re-registered during recovering from 
checkpoint
 Key: SPARK-5206
 URL: https://issues.apache.org/jira/browse/SPARK-5206
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: vincent ye


I got exception as following while my streaming application restarts from crash 
from checkpoit:

15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR 
scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, 4)
java.util.NoSuchElementException: key not found: 1
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4859) Improve StreamingListenerBus

2015-01-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4859:
-
Priority: Major  (was: Minor)

> Improve StreamingListenerBus
> 
>
> Key: SPARK-4859
> URL: https://issues.apache.org/jira/browse/SPARK-4859
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Shixiong Zhu
>
> Fix the race condition of `queueFullErrorMessageLogged`.
> Log the error from listener rather than crashing `listenerThread`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4859) Improve StreamingListenerBus

2015-01-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4859:
-
Target Version/s: 1.3.0

> Improve StreamingListenerBus
> 
>
> Key: SPARK-4859
> URL: https://issues.apache.org/jira/browse/SPARK-4859
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Shixiong Zhu
>
> Fix the race condition of `queueFullErrorMessageLogged`.
> Log the error from listener rather than crashing `listenerThread`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4859) Improve StreamingListenerBus

2015-01-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4859:
-
Affects Version/s: 1.0.0

> Improve StreamingListenerBus
> 
>
> Key: SPARK-4859
> URL: https://issues.apache.org/jira/browse/SPARK-4859
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Shixiong Zhu
>Priority: Minor
>
> Fix the race condition of `queueFullErrorMessageLogged`.
> Log the error from listener rather than crashing `listenerThread`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5124) Standardize internal RPC interface

2015-01-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273885#comment-14273885
 ] 

Reynold Xin commented on SPARK-5124:


1. Let's put that outside of this PR (either leave it as an actor for now and 
follow up to change it to a loop, or submit a separate PR to change it to a 
loop before we merge the actor PR).

2. Yes - you don't necessarily need an alternative implementation, but making 
sure the current API design can indeed support alternative implementations is a 
good idea.


> Standardize internal RPC interface
> --
>
> Key: SPARK-5124
> URL: https://issues.apache.org/jira/browse/SPARK-5124
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
> Attachments: Pluggable RPC - draft 1.pdf
>
>
> In Spark we use Akka as the RPC layer. It would be great if we can 
> standardize the internal RPC interface to facilitate testing. This will also 
> provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2909) Indexing for SparseVector in pyspark

2015-01-12 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273880#comment-14273880
 ] 

Manoj Kumar commented on SPARK-2909:


[~josephkb] Sorry for spamming your inbox, but just a heads up that I'm working 
on this. Will mostly submit a Pull Request by tomorrow.

> Indexing for SparseVector in pyspark
> 
>
> Key: SPARK-2909
> URL: https://issues.apache.org/jira/browse/SPARK-2909
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> SparseVector in pyspark does not currently support indexing, except by 
> examining the internal representation.  Though indexing is a pricy operation, 
> it would be useful for, e.g., iterating through a dataset (RDD[LabeledPoint]) 
> and operating on a single feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2584) Do not mutate block storage level on the UI

2015-01-12 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273877#comment-14273877
 ] 

Ilya Ganelin commented on SPARK-2584:
-

Understood, I was looking at the UI for Spark 1.1 and did not see the block 
storage level represented as MEMORY_AND_DISK or DISK_ONLY. It's now presented 
as Memory Deserialized or Disk Deserialized. I'll attempt to recreate this 
problem in the newer version of Spark but wanted to know if you've seen it 
since 1.0.1. 

> Do not mutate block storage level on the UI
> ---
>
> Key: SPARK-2584
> URL: https://issues.apache.org/jira/browse/SPARK-2584
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.0.1
>Reporter: Andrew Or
>
> If a block is stored MEMORY_AND_DISK and we drop it from memory, it becomes 
> DISK_ONLY on the UI. We should preserve the original storage level  proposed 
> by the user, in addition to the change in actual storage level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2584) Do not mutate block storage level on the UI

2015-01-12 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273871#comment-14273871
 ] 

Andrew Or commented on SPARK-2584:
--

When the in-memory cache is full, the RDD will be automatically dropped from 
memory to disk without the user explicitly calling anything. This is what I 
mean by drop it from memory.

> Do not mutate block storage level on the UI
> ---
>
> Key: SPARK-2584
> URL: https://issues.apache.org/jira/browse/SPARK-2584
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.0.1
>Reporter: Andrew Or
>
> If a block is stored MEMORY_AND_DISK and we drop it from memory, it becomes 
> DISK_ONLY on the UI. We should preserve the original storage level  proposed 
> by the user, in addition to the change in actual storage level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2584) Do not mutate block storage level on the UI

2015-01-12 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273762#comment-14273762
 ] 

Ilya Ganelin commented on SPARK-2584:
-

Hi Andrew, question about this. When you say "we drop it from memory" what 
mechanism are you talking about? It's illegal to change the persistence level 
of an already persisted RDD and if you call unpersist() it's dropped from both 
memory and disk storage. How would an RDD be "dropped" from memory? 

> Do not mutate block storage level on the UI
> ---
>
> Key: SPARK-2584
> URL: https://issues.apache.org/jira/browse/SPARK-2584
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.0.1
>Reporter: Andrew Or
>
> If a block is stored MEMORY_AND_DISK and we drop it from memory, it becomes 
> DISK_ONLY on the UI. We should preserve the original storage level  proposed 
> by the user, in addition to the change in actual storage level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2584) Do not mutate block storage level on the UI

2015-01-12 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273762#comment-14273762
 ] 

Ilya Ganelin edited comment on SPARK-2584 at 1/12/15 4:47 PM:
--

Hi Andrew, question about this. When you say "we drop it from memory" what 
mechanism are you talking about? It's illegal to change the persistence level 
of an already persisted RDD and if you call unpersist() it's dropped from both 
memory and disk storage. How would an RDD be "dropped" from memory? I'm just 
trying to reproduce the issue before creating a fix. 


was (Author: ilganeli):
Hi Andrew, question about this. When you say "we drop it from memory" what 
mechanism are you talking about? It's illegal to change the persistence level 
of an already persisted RDD and if you call unpersist() it's dropped from both 
memory and disk storage. How would an RDD be "dropped" from memory? 

> Do not mutate block storage level on the UI
> ---
>
> Key: SPARK-2584
> URL: https://issues.apache.org/jira/browse/SPARK-2584
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.0.1
>Reporter: Andrew Or
>
> If a block is stored MEMORY_AND_DISK and we drop it from memory, it becomes 
> DISK_ONLY on the UI. We should preserve the original storage level  proposed 
> by the user, in addition to the change in actual storage level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5205) Inconsistent behaviour between Streaming job and others, when click kill link in WebUI

2015-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273759#comment-14273759
 ] 

Apache Spark commented on SPARK-5205:
-

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4008

> Inconsistent behaviour between Streaming job and others, when click kill link 
> in WebUI
> --
>
> Key: SPARK-5205
> URL: https://issues.apache.org/jira/browse/SPARK-5205
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: uncleGen
>
> The "kill" link is used to kill a stage in job. It works in any kinds of 
> Spark job but Spark Streaming. To be specific, we can only kill the stage 
> which is used to run "Receiver", but not kill the "Receivers". Well, the 
> stage can be killed and cleaned from the ui, but the receivers are still 
> alive and receiving data. I think it dose not fit with the common sense. 
> IMHO, killing the "receiver" stage means kill the "receivers" and stopping 
> receiving data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5205) Inconsistent behaviour between Streaming job and others, when click kill link in WebUI

2015-01-12 Thread uncleGen (JIRA)
uncleGen created SPARK-5205:
---

 Summary: Inconsistent behaviour between Streaming job and others, 
when click kill link in WebUI
 Key: SPARK-5205
 URL: https://issues.apache.org/jira/browse/SPARK-5205
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: uncleGen


The "kill" link is used to kill a stage in job. It works in any kinds of Spark 
job but Spark Streaming. To be specific, we can only kill the stage which is 
used to run "Receiver", but not kill the "Receivers". Well, the stage can be 
killed and cleaned from the ui, but the receivers are still alive and receiving 
data. I think it dose not fit with the common sense. IMHO, killing the 
"receiver" stage means kill the "receivers" and stopping receiving data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5164) YARN | Spark job submits from windows machine to a linux YARN cluster fail

2015-01-12 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273631#comment-14273631
 ] 

Kousuke Saruta commented on SPARK-5164:
---

This ticket is a duplication of SPARK-1825 right?

> YARN | Spark job submits from windows machine to a linux YARN cluster fail
> --
>
> Key: SPARK-5164
> URL: https://issues.apache.org/jira/browse/SPARK-5164
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
> Environment: Spark submit from Windows 7
> YARN cluster on CentOS 6.5
>Reporter: Aniket Bhatnagar
>
> While submitting spark jobs from a windows machine to a linux YARN cluster, 
> the jobs fail because of the following reasons:
> 1. Commands and classpath contain environment variables (like JAVA_HOME, PWD, 
> etc) but are added as per windows's syntax (%JAVA_HOME%, %PWD%, etc) instead 
> of linux's syntax ($JAVA_HOME, $PWD, etc).
> 2. Paths in launch environment are delimited by semi-colon instead of colon. 
> This is because of usage of File.pathSeparator in YarnSparkHadoopUtil.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark

2015-01-12 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273609#comment-14273609
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Thanks Patrick

I 100% agree that Spark is _NOT just an API_ and in fact in our current efforts 
we are using much more of Spark then its user facing API but here is the thing; 
The reasons for extending execution environment could be many and indeed _RDD_ 
is a great extension point, just like _SparkContext_ is to accomplish that. 
However, both are less then ideal since they would require constant code 
modification forcing _re-compilation and re-packaging_ of an application every 
time one wants to delegate to an alternative execution environment (regardless 
of the reasons).
But since we all seem to agree (based on previous comments) that _SparkContext_ 
is the right API-based extension point to address such extension requirements, 
then why not allow it to be extended via configuration as well? Merely a 
convenience without any harm. . . . no different then a configuration based 
“driver” model (e.g., JDBC).




> Allow for pluggable execution contexts in Spark
> ---
>
> Key: SPARK-3561
> URL: https://issues.apache.org/jira/browse/SPARK-3561
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Oleg Zhurakousky
>  Labels: features
> Attachments: SPARK-3561.pdf
>
>
> Currently Spark provides integration with external resource-managers such as 
> Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
> current architecture of Spark-on-YARN can be enhanced to provide 
> significantly better utilization of cluster resources for large scale, batch 
> and/or ETL applications when run alongside other applications (Spark and 
> others) and services in YARN. 
> Proposal: 
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - a gateway and a delegate to Hadoop execution environment - as a non-public 
> api (@Experimental) not exposed to end users of Spark. 
> The trait will define 6 operations: 
> * hadoopFile 
> * newAPIHadoopFile 
> * broadcast 
> * runJob 
> * persist
> * unpersist
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext. 
> Please see the attached design doc for more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian

2015-01-12 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273565#comment-14273565
 ] 

Travis Galoppo commented on SPARK-5019:
---

[~lewuathe] Are you still interested in working on this ticket? SPARK-5018 is 
now complete.

> Update GMM API to use MultivariateGaussian
> --
>
> Key: SPARK-5019
> URL: https://issues.apache.org/jira/browse/SPARK-5019
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> The GaussianMixtureModel API should expose MultivariateGaussian instances 
> instead of the means and covariances.  This should be fixed as soon as 
> possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5102) CompressedMapStatus needs to be registered with Kryo

2015-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273564#comment-14273564
 ] 

Apache Spark commented on SPARK-5102:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/4007

> CompressedMapStatus needs to be registered with Kryo
> 
>
> Key: SPARK-5102
> URL: https://issues.apache.org/jira/browse/SPARK-5102
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Daniel Darabos
>Priority: Minor
>
> After upgrading from Spark 1.1.0 to 1.2.0 I got this exception:
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Class is 
> not registered: org.apache.spark.scheduler.CompressedMapStatus
> Note: To register this class use: 
> kryo.register(org.apache.spark.scheduler.CompressedMapStatus.class);
>   at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
>   at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I had to register {{org.apache.spark.scheduler.CompressedMapStatus}} with 
> Kryo. I think this should be done in 
> {{spark/serializer/KryoSerializer.scala}}, unless instances of this class are 
> not expected to be sent over the wire. (Maybe I'm doing something wrong?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model

2015-01-12 Thread Meethu Mathew (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273561#comment-14273561
 ] 

Meethu Mathew commented on SPARK-5012:
--

I added a new class GaussianMixtureModel in clustering.py and the method
predict in it and trying to pass a List  of more than one dimension to the 
function
_py2java , but I am getting the exception 

'list' object has no attribute '_get_object_id'

and when I give a tuple input (Vectors.dense([0.8786,
-0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like

'numpy.ndarray' object has no attribute '_get_object_id'.   Can you help me to 
solve this?

My aim is to call the predictsoft() in GaussianMixtureModel.scala from 
clustering.py by passing the values of weight,mean and sigma 

> Python API for Gaussian Mixture Model
> -
>
> Key: SPARK-5012
> URL: https://issues.apache.org/jira/browse/SPARK-5012
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Meethu Mathew
>
> Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5124) Standardize internal RPC interface

2015-01-12 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273560#comment-14273560
 ] 

Shixiong Zhu commented on SPARK-5124:
-

{quote}
1. Let's not rely on the property of local actor not passing messages through a 
socket for local actor speedup. Conceptually, there is no reason to tie local 
actor implementation to RPC. DAGScheduler's actor used to be a simple queue & 
event loop (before it was turned into an actor for no good reason). We can 
restore it to that.
{quote}
OK. I will change DAGScheduler actor to a simple event loop.

{quote}
2. Have you thought about how the fate sharing stuff would work with 
alternative RPC implementations?
{quote}

Just want to make sure we are thinking the same thing: do you mean how to 
notify DisassociatedEvent in alternative RPC implementation? If so, I'm 
thinking how to extract it from the RPC layer. But have not yet started it.

> Standardize internal RPC interface
> --
>
> Key: SPARK-5124
> URL: https://issues.apache.org/jira/browse/SPARK-5124
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
> Attachments: Pluggable RPC - draft 1.pdf
>
>
> In Spark we use Akka as the RPC layer. It would be great if we can 
> standardize the internal RPC interface to facilitate testing. This will also 
> provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-12 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273551#comment-14273551
 ] 

Valeriy Avanesov commented on SPARK-1405:
-

[~josephkb], I've read your proposal and I suggest to consider Stochastic 
Gradient Langevin Dynamics [1]. It was shown be ~100 times faster than Gibbs 
sampling [2]. Though, I'm not sure if it's implementable in terms of RDD. 

[1] 
http://papers.nips.cc/paper/4883-stochastic-gradient-riemannian-langevin-dynamics-on-the-probability-simplex.pdf
[2] http://www.ics.uci.edu/~sungjia/icml2014_dist_v0.2.pdf

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4859) Improve StreamingListenerBus

2015-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273535#comment-14273535
 ] 

Apache Spark commented on SPARK-4859:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/4006

> Improve StreamingListenerBus
> 
>
> Key: SPARK-4859
> URL: https://issues.apache.org/jira/browse/SPARK-4859
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Priority: Minor
>
> Fix the race condition of `queueFullErrorMessageLogged`.
> Log the error from listener rather than crashing `listenerThread`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5204) Column case need to be consistent with Hive

2015-01-12 Thread shengli (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shengli closed SPARK-5204.
--
Resolution: Not a Problem

> Column case need to be consistent with Hive
> ---
>
> Key: SPARK-5204
> URL: https://issues.apache.org/jira/browse/SPARK-5204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: shengli
>Priority: Minor
> Fix For: 1.3.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Column case need to be consistent with Hive
> Hive0.13  -> lower case



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5204) Column case need to be consistent with Hive

2015-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273417#comment-14273417
 ] 

Apache Spark commented on SPARK-5204:
-

User 'OopsOutOfMemory' has created a pull request for this issue:
https://github.com/apache/spark/pull/4005

> Column case need to be consistent with Hive
> ---
>
> Key: SPARK-5204
> URL: https://issues.apache.org/jira/browse/SPARK-5204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: shengli
>Priority: Minor
> Fix For: 1.3.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Column case need to be consistent with Hive
> Hive0.13  -> lower case



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5204) Column case need to be consistent with Hive

2015-01-12 Thread shengli (JIRA)
shengli created SPARK-5204:
--

 Summary: Column case need to be consistent with Hive
 Key: SPARK-5204
 URL: https://issues.apache.org/jira/browse/SPARK-5204
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: shengli
Priority: Minor
 Fix For: 1.3.0


Column case need to be consistent with Hive
Hive0.13  -> lower case



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5203) union with different decimal type report error

2015-01-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273404#comment-14273404
 ] 

Apache Spark commented on SPARK-5203:
-

User 'guowei2' has created a pull request for this issue:
https://github.com/apache/spark/pull/4004

> union with different decimal type report error
> --
>
> Key: SPARK-5203
> URL: https://issues.apache.org/jira/browse/SPARK-5203
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: guowei
>
> cases like this
> create table test (a decimal(10,1));
> select a from test union all select a*2 from test;
> 15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union 
> all select a*2 from test]
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
> attributes: *, tree:
> 'Project [*]
>  'Subquery _u1
>   'Union 
>Project [a#1]
> MetastoreRelation default, test, None
>Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), 
> DecimalType())), DecimalType(21,1)) AS _c0#0]
> MetastoreRelation default, test, None
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421)
>   at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369)
>   at 
> org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5203) union with different decimal type report error

2015-01-12 Thread guowei (JIRA)
guowei created SPARK-5203:
-

 Summary: union with different decimal type report error
 Key: SPARK-5203
 URL: https://issues.apache.org/jira/browse/SPARK-5203
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: guowei


cases like this
create table test (a decimal(10,1));
select a from test union all select a*2 from test;

15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union all 
select a*2 from test]
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
'Project [*]
 'Subquery _u1
  'Union 
   Project [a#1]
MetastoreRelation default, test, None
   Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), 
DecimalType())), DecimalType(21,1)) AS _c0#0]
MetastoreRelation default, test, None

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369)
at 
org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org