[jira] [Created] (SPARK-8388) The script "docs/_plugins/copy_api_dirs.rb" should be run anywhere

2015-06-15 Thread KaiXinXIaoLei (JIRA)
KaiXinXIaoLei created SPARK-8388:


 Summary: The script "docs/_plugins/copy_api_dirs.rb" should be run 
anywhere
 Key: SPARK-8388
 URL: https://issues.apache.org/jira/browse/SPARK-8388
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei
Priority: Minor
 Fix For: 1.4.1


The script "copy_api_dirs.rb" in spark/docs/_plugins should be run anywhere. 
But now, you have to be in "spark/docs", and run "ruby 
_plugins/copy_api_dirs.rb"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8129) Securely pass auth secrets to executors in standalone cluster mode

2015-06-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8129:
-
Priority: Minor  (was: Critical)

> Securely pass auth secrets to executors in standalone cluster mode
> --
>
> Key: SPARK-8129
> URL: https://issues.apache.org/jira/browse/SPARK-8129
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Minor
> Fix For: 1.5.0
>
>
> Currently, when authentication is turned on, the standalone cluster manager 
> passes auth secrets to executors (also drivers in cluster mode) as java 
> options on the command line, which isn't secure. The passed secret can be 
> seen by anyone running 'ps' command, e.g.,
> bq.  501 94787 94734   0  2:32PM ?? 0:00.78 
> /Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
> -cp 
> /Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
>  -Xms512M -Xmx512M 
> *-Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
> -Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
> --executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
> app-20150605143259- --worker-url 
> akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8129) Securely pass auth secrets to executors in standalone cluster mode

2015-06-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8129:
-
Assignee: Kan Zhang

> Securely pass auth secrets to executors in standalone cluster mode
> --
>
> Key: SPARK-8129
> URL: https://issues.apache.org/jira/browse/SPARK-8129
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Critical
> Fix For: 1.5.0
>
>
> Currently, when authentication is turned on, the standalone cluster manager 
> passes auth secrets to executors (also drivers in cluster mode) as java 
> options on the command line, which isn't secure. The passed secret can be 
> seen by anyone running 'ps' command, e.g.,
> bq.  501 94787 94734   0  2:32PM ?? 0:00.78 
> /Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
> -cp 
> /Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
>  -Xms512M -Xmx512M 
> *-Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
> -Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
> --executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
> app-20150605143259- --worker-url 
> akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8380) SparkR mis-counts

2015-06-15 Thread Rick Moritz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Moritz closed SPARK-8380.
--
Resolution: Invalid

I got my columns mixed up, late in the evening after a frustrating day with 
SparkR's documentation.
With the correct columns, the counts are equal in both expression types and via 
both platforms.

> SparkR mis-counts
> -
>
> Key: SPARK-8380
> URL: https://issues.apache.org/jira/browse/SPARK-8380
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Rick Moritz
>
> On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can 
> perform count operations on the entirety of the dataset and get the correct 
> value, as double checked against the same code in scala.
> When I start to add conditions or even do a simple partial ascending 
> histogram, I get discrepancies.
> In particular, there are missing values in SparkR, and massively so:
> A top 6 count of a certain feature in my dataset results in an order of 
> magnitude smaller numbers, than I get via scala.
> The following logic, which I consider equivalent is the basis for this report:
> counts<-summarize(groupBy(df, df$col_name), count = n(tdf$col_name))
> head(arrange(counts, desc(counts$count)))
> versus:
> val table = sql("SELECT col_name, count(col_name) as value from df  group by 
> col_name order by value desc")
> The first, in particular, is taken directly from the SparkR programming 
> guide. Since summarize isn't documented from what I can see, I'd hope it does 
> what the programming guide indicates. In that case this would be a pretty 
> serious logic bug (no errors are thrown). Otherwise, there's the possibility 
> of a lack of documentation and badly worded example in the guide being behind 
> my misperception of SparkRs functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8129) Securely pass auth secrets to executors in standalone cluster mode

2015-06-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8129.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6774
[https://github.com/apache/spark/pull/6774]

> Securely pass auth secrets to executors in standalone cluster mode
> --
>
> Key: SPARK-8129
> URL: https://issues.apache.org/jira/browse/SPARK-8129
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Reporter: Kan Zhang
>Priority: Critical
> Fix For: 1.5.0
>
>
> Currently, when authentication is turned on, the standalone cluster manager 
> passes auth secrets to executors (also drivers in cluster mode) as java 
> options on the command line, which isn't secure. The passed secret can be 
> seen by anyone running 'ps' command, e.g.,
> bq.  501 94787 94734   0  2:32PM ?? 0:00.78 
> /Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre/bin/java 
> -cp 
> /Users/kan/github/spark/sbin/../conf/:/Users/kan/github/spark/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.3.0.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/kan/github/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
>  -Xms512M -Xmx512M 
> *-Dspark.authenticate.secret=090A030E0F0A0501090A0C0E0C0B03050D05* 
> -Dspark.driver.port=49625 -Dspark.authenticate=true -XX:MaxPermSize=128m 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> akka.tcp://sparkDriver@192.168.1.152:49625/user/CoarseGrainedScheduler 
> --executor-id 0 --hostname 192.168.1.152 --cores 8 --app-id 
> app-20150605143259- --worker-url 
> akka.tcp://sparkWorker@192.168.1.152:49623/user/Worker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8367) ReliableKafka will loss data when `spark.streaming.blockInterval` was 0

2015-06-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8367.
--
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.1

> ReliableKafka will loss data when `spark.streaming.blockInterval` was 0
> ---
>
> Key: SPARK-8367
> URL: https://issues.apache.org/jira/browse/SPARK-8367
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: SaintBacchus
>Assignee: SaintBacchus
> Fix For: 1.4.1, 1.5.0
>
>
> {code:title=BlockGenerator.scala|borderStyle=solid}
>   /** Change the buffer to which single records are added to. */
>   private def updateCurrentBuffer(time: Long): Unit = synchronized {
> try {
>   val newBlockBuffer = currentBuffer
>   currentBuffer = new ArrayBuffer[Any]
>   if (newBlockBuffer.size > 0) {
>val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
> val newBlock = new Block(blockId, newBlockBuffer)
> listener.onGenerateBlock(blockId)
> blocksForPushing.put(newBlock)  // put is blocking when queue is full
> logDebug("Last element in " + blockId + " is " + newBlockBuffer.last)
>   }
> } catch {
>   case ie: InterruptedException =>
> logInfo("Block updating timer thread was interrupted")
>   case e: Exception =>
> reportError("Error in block updating thread", e)
> }
>   }
> {code}
> If *spark.streaming.blockInterval* was 0, the *blockId* in the code will 
> always be the same because of  *time* was 0 and *blockIntervalMs* was 0 too.
> {code:title=ReliableKafkaReceiver.scala|borderStyle=solid}
>private def rememberBlockOffsets(blockId: StreamBlockId): Unit = {
> // Get a snapshot of current offset map and store with related block id.
> val offsetSnapshot = topicPartitionOffsetMap.toMap
> blockOffsetMap.put(blockId, offsetSnapshot)
> topicPartitionOffsetMap.clear()
>   }
> {code}
> If the *blockId* was the same,  Streaming will commit the  *offset*  before 
> the really data comsumed(data was waitting to be commit but the offset had 
> updated and commit by previous commit)
> So when exception occures, the *offset* had commit but the data will loss 
> since the data was in memory and not comsumed yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8367) ReliableKafka will loss data when `spark.streaming.blockInterval` was 0

2015-06-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8367:
-
Assignee: SaintBacchus

> ReliableKafka will loss data when `spark.streaming.blockInterval` was 0
> ---
>
> Key: SPARK-8367
> URL: https://issues.apache.org/jira/browse/SPARK-8367
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.0
>Reporter: SaintBacchus
>Assignee: SaintBacchus
>
> {code:title=BlockGenerator.scala|borderStyle=solid}
>   /** Change the buffer to which single records are added to. */
>   private def updateCurrentBuffer(time: Long): Unit = synchronized {
> try {
>   val newBlockBuffer = currentBuffer
>   currentBuffer = new ArrayBuffer[Any]
>   if (newBlockBuffer.size > 0) {
>val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
> val newBlock = new Block(blockId, newBlockBuffer)
> listener.onGenerateBlock(blockId)
> blocksForPushing.put(newBlock)  // put is blocking when queue is full
> logDebug("Last element in " + blockId + " is " + newBlockBuffer.last)
>   }
> } catch {
>   case ie: InterruptedException =>
> logInfo("Block updating timer thread was interrupted")
>   case e: Exception =>
> reportError("Error in block updating thread", e)
> }
>   }
> {code}
> If *spark.streaming.blockInterval* was 0, the *blockId* in the code will 
> always be the same because of  *time* was 0 and *blockIntervalMs* was 0 too.
> {code:title=ReliableKafkaReceiver.scala|borderStyle=solid}
>private def rememberBlockOffsets(blockId: StreamBlockId): Unit = {
> // Get a snapshot of current offset map and store with related block id.
> val offsetSnapshot = topicPartitionOffsetMap.toMap
> blockOffsetMap.put(blockId, offsetSnapshot)
> topicPartitionOffsetMap.clear()
>   }
> {code}
> If the *blockId* was the same,  Streaming will commit the  *offset*  before 
> the really data comsumed(data was waitting to be commit but the offset had 
> updated and commit by previous commit)
> So when exception occures, the *offset* had commit but the data will loss 
> since the data was in memory and not comsumed yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation

2015-06-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587527#comment-14587527
 ] 

Sean Owen commented on SPARK-8385:
--

What is the TFS file system? it sounds like you are missing a JAR that adds 
support for this to Hadoop FileSystem somewhere.

> java.lang.UnsupportedOperationException: Not implemented by the TFS 
> FileSystem implementation
> -
>
> Key: SPARK-8385
> URL: https://issues.apache.org/jira/browse/SPARK-8385
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.4.0
> Environment: RHEL 7.1
>Reporter: Peter Haumer
>
> I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I 
> created a launch and just set the vm var "-Dspark.master=local[4]".  
> With 1.4 this stopped working when reading files from the OS filesystem. 
> Running the same apps with spark-submit works fine.  Loosing the ability to 
> debug that way has a major impact on the usability of Spark.
> The following exception is thrown:
> Exception in thread "main" java.lang.UnsupportedOperationException: Not 
> implemented by the TFS FileSystem implementation
> at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213)
> at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401)
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166)
> at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
> at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
> at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
> at scala.Option.map(Option.scala:145)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:900)
> at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357)
> at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46)
> at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8384) Can not set checkpointDuration or Interval in spark 1.3 and later

2015-06-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8384.
--
Resolution: Invalid

> Can not set checkpointDuration or Interval in spark 1.3 and later
> -
>
> Key: SPARK-8384
> URL: https://issues.apache.org/jira/browse/SPARK-8384
> Project: Spark
>  Issue Type: Bug
>Reporter: Norman He
>Priority: Critical
>
> StreamingContext missing setCheckpointDuration().
> No way around for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8384) Can not set checkpointDuration or Interval in spark 1.3 and later

2015-06-15 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587515#comment-14587515
 ] 

Saisai Shao commented on SPARK-8384:


Hi [~nhe150], I'm not sure why you need to set checkpoint duration, checkpoint 
duration is the same as batchDuration internally, and will lead to unexpected 
behavior when set to a unwanted value. Also I checked branch 1.2, seems there 
has no such API {{setCheckpointDuration}}.

> Can not set checkpointDuration or Interval in spark 1.3 and later
> -
>
> Key: SPARK-8384
> URL: https://issues.apache.org/jira/browse/SPARK-8384
> Project: Spark
>  Issue Type: Bug
>Reporter: Norman He
>Priority: Critical
>
> StreamingContext missing setCheckpointDuration().
> No way around for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7184) Investigate turning codegen on by default

2015-06-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7184.

   Resolution: Fixed
Fix Version/s: 1.5.0

> Investigate turning codegen on by default
> -
>
> Key: SPARK-7184
> URL: https://issues.apache.org/jira/browse/SPARK-7184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
> Fix For: 1.5.0
>
>
> If it is not the default, users get suboptimal performance out of the box, 
> and the codegen path falls behind the interpreted path over time.
> The best option might be to have only the codegen path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8206) math function: round

2015-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587509#comment-14587509
 ] 

Apache Spark commented on SPARK-8206:
-

User 'zhichao-li' has created a pull request for this issue:
https://github.com/apache/spark/pull/6836

> math function: round
> 
>
> Key: SPARK-8206
> URL: https://issues.apache.org/jira/browse/SPARK-8206
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: zhichao-li
>
> round(double a): double
> Returns the rounded BIGINT value of a.
> round(double a, INT d): double
> Returns a rounded to d decimal places.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7837) NPE when save as parquet in speculative tasks

2015-06-15 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587491#comment-14587491
 ] 

Yin Huai commented on SPARK-7837:
-

Seems https://www.mail-archive.com/user@spark.apache.org/msg30327.html is about 
the same issue.

> NPE when save as parquet in speculative tasks
> -
>
> Key: SPARK-7837
> URL: https://issues.apache.org/jira/browse/SPARK-7837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yin Huai
>Priority: Critical
>
> The query is like {{df.orderBy(...).saveAsTable(...)}}.
> When there is no partitioning columns and there is a skewed key, I found the 
> following exception in speculative tasks. After these failures, seems we 
> could not call {{SparkHadoopMapRedUtil.commitTask}} correctly.
> {code}
> java.lang.NullPointerException
>   at 
> parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146)
>   at 
> parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
>   at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
>   at 
> org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115)
>   at 
> org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
>   at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5680) Sum function on all null values, should return zero

2015-06-15 Thread Venkata Ramana G (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587473#comment-14587473
 ] 

Venkata Ramana G commented on SPARK-5680:
-

Holman, You are right that column with all NULL values should return NULL.
As my motivation was to fix udaf_number_format.q, "select sum('a') from src" 
returns 0 in hive, mysql.
 and "select cast('a' as double) from src" returned NULL in hive.
I assumed or rather wrongly analysed it as "Sum of ALL NULLs return 0" and this 
has introduced the problem.
I apologize for this and will submit the patch to revert that fix. 

"select sum('a') from src" returning 0 in hive and mysql created this 
confusion, is still not clear.


> Sum function on all null values, should return zero
> ---
>
> Key: SPARK-5680
> URL: https://issues.apache.org/jira/browse/SPARK-5680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Venkata Ramana G
>Assignee: Venkata Ramana G
>Priority: Minor
> Fix For: 1.3.1, 1.4.0
>
>
> SELECT  sum('a'),  avg('a'),  variance('a'),  std('a') FROM src;
> Current output:
> NULL  NULLNULLNULL
> Expected output:
> 0.0   NULLNULLNULL
> This fixes hive udaf_number_format.q 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8281) udf_asin and udf_acos test failure

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8281:
---

Assignee: (was: Apache Spark)

> udf_asin and udf_acos test failure
> --
>
> Key: SPARK-8281
> URL: https://issues.apache.org/jira/browse/SPARK-8281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> acos/asin in Hive returns NaN for not a number, whereas we always return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8280) udf7 failed due to null vs nan semantics

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8280:
---

Assignee: Apache Spark

> udf7 failed due to null vs nan semantics
> 
>
> Key: SPARK-8280
> URL: https://issues.apache.org/jira/browse/SPARK-8280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Blocker
>
> To execute
> {code}
> sbt/sbt -Phive -Dspark.hive.whitelist="udf7.*" "hive/test-only 
> org.apache.spark.sql.hive.execution.HiveCompatibilitySuite"
> {code}
> If we want to be consistent with Hive, we need to special case our log 
> function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8281) udf_asin and udf_acos test failure

2015-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587462#comment-14587462
 ] 

Apache Spark commented on SPARK-8281:
-

User 'yijieshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/6835

> udf_asin and udf_acos test failure
> --
>
> Key: SPARK-8281
> URL: https://issues.apache.org/jira/browse/SPARK-8281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> acos/asin in Hive returns NaN for not a number, whereas we always return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8280) udf7 failed due to null vs nan semantics

2015-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587461#comment-14587461
 ] 

Apache Spark commented on SPARK-8280:
-

User 'yijieshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/6835

> udf7 failed due to null vs nan semantics
> 
>
> Key: SPARK-8280
> URL: https://issues.apache.org/jira/browse/SPARK-8280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> To execute
> {code}
> sbt/sbt -Phive -Dspark.hive.whitelist="udf7.*" "hive/test-only 
> org.apache.spark.sql.hive.execution.HiveCompatibilitySuite"
> {code}
> If we want to be consistent with Hive, we need to special case our log 
> function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8280) udf7 failed due to null vs nan semantics

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8280:
---

Assignee: (was: Apache Spark)

> udf7 failed due to null vs nan semantics
> 
>
> Key: SPARK-8280
> URL: https://issues.apache.org/jira/browse/SPARK-8280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> To execute
> {code}
> sbt/sbt -Phive -Dspark.hive.whitelist="udf7.*" "hive/test-only 
> org.apache.spark.sql.hive.execution.HiveCompatibilitySuite"
> {code}
> If we want to be consistent with Hive, we need to special case our log 
> function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8281) udf_asin and udf_acos test failure

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8281:
---

Assignee: Apache Spark

> udf_asin and udf_acos test failure
> --
>
> Key: SPARK-8281
> URL: https://issues.apache.org/jira/browse/SPARK-8281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Blocker
>
> acos/asin in Hive returns NaN for not a number, whereas we always return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8368) ClassNotFoundException in closure for map

2015-06-15 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587459#comment-14587459
 ] 

Yin Huai edited comment on SPARK-8368 at 6/16/15 4:35 AM:
--

[~zwChan] How was the application submitted?


was (Author: yhuai):
@CHEN Zhiwei How was the application submitted?

> ClassNotFoundException in closure for map 
> --
>
> Key: SPARK-8368
> URL: https://issues.apache.org/jira/browse/SPARK-8368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
> Environment: Centos 6.5, java 1.7.0_67, scala 2.10.4. Build the 
> project on Windows 7 and run in a spark standalone cluster(or local) mode on 
> Centos 6.X. 
>Reporter: CHEN Zhiwei
>
> After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the 
> following exception:
> ==begin exception
> {quote}
> Exception in thread "main" java.lang.ClassNotFoundException: 
> com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:278)
>   at 
> org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
>  Source)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
>  Source)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
>   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
>   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:293)
>   at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210)
>   at com.yhd.ycache.magic.Model$.main(SSExample.scala:239)
>   at com.yhd.ycache.magic.Model.main(SSExample.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}
> ===end exception===
> I simplify the code that cause this issue, as following:
> ==begin code==
> {noformat}
> object Model extends Serializable{
>   def main(args: Array[String]) {
> val Array(sql) = args
> val sparkConf = new SparkConf().setAppName("Mode Example")
> val sc = new SparkContext(sparkConf)
> val hive = new HiveContext(sc)
> //get data by hive sql
> val rows = hive.sql(sql)
> val data = rows.map(r => { 
>   val arr = r.toSeq.toArray
>   val label = 1.0
>   def fmap = ( input: Any ) => 1.0
>   val feature = arr.map(_=>1.0)
>   LabeledPoint(label, Vectors.dense(feature))
> })
> data.count()
>   }
> }
> {noformat}
> =end code===
> This code can run pretty well on spark-shell, but error when submit it to 
> spark cluster (standalone or local mode).  I try the same code on spark 
> 1.3.0(local mode), and no exception is encountered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: iss

[jira] [Commented] (SPARK-8368) ClassNotFoundException in closure for map

2015-06-15 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587459#comment-14587459
 ] 

Yin Huai commented on SPARK-8368:
-

@CHEN Zhiwei How was the application submitted?

> ClassNotFoundException in closure for map 
> --
>
> Key: SPARK-8368
> URL: https://issues.apache.org/jira/browse/SPARK-8368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
> Environment: Centos 6.5, java 1.7.0_67, scala 2.10.4. Build the 
> project on Windows 7 and run in a spark standalone cluster(or local) mode on 
> Centos 6.X. 
>Reporter: CHEN Zhiwei
>
> After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the 
> following exception:
> ==begin exception
> {quote}
> Exception in thread "main" java.lang.ClassNotFoundException: 
> com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:278)
>   at 
> org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
>  Source)
>   at 
> com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
>  Source)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
>   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
>   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:293)
>   at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210)
>   at com.yhd.ycache.magic.Model$.main(SSExample.scala:239)
>   at com.yhd.ycache.magic.Model.main(SSExample.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}
> ===end exception===
> I simplify the code that cause this issue, as following:
> ==begin code==
> {noformat}
> object Model extends Serializable{
>   def main(args: Array[String]) {
> val Array(sql) = args
> val sparkConf = new SparkConf().setAppName("Mode Example")
> val sc = new SparkContext(sparkConf)
> val hive = new HiveContext(sc)
> //get data by hive sql
> val rows = hive.sql(sql)
> val data = rows.map(r => { 
>   val arr = r.toSeq.toArray
>   val label = 1.0
>   def fmap = ( input: Any ) => 1.0
>   val feature = arr.map(_=>1.0)
>   LabeledPoint(label, Vectors.dense(feature))
> })
> data.count()
>   }
> }
> {noformat}
> =end code===
> This code can run pretty well on spark-shell, but error when submit it to 
> spark cluster (standalone or local mode).  I try the same code on spark 
> 1.3.0(local mode), and no exception is encountered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6724) Model import/export for FPGrowth

2015-06-15 Thread Hrishikesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583139#comment-14583139
 ] 

Hrishikesh edited comment on SPARK-6724 at 6/16/15 4:22 AM:


[~josephkb],  please assign this ticket to me.


was (Author: hrishikesh91):
[~josephkb], please assign this ticket to me.

> Model import/export for FPGrowth
> 
>
> Key: SPARK-6724
> URL: https://issues.apache.org/jira/browse/SPARK-6724
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8387) [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8387:
---

Assignee: (was: Apache Spark)

> [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all
> -
>
> Key: SPARK-8387
> URL: https://issues.apache.org/jira/browse/SPARK-8387
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: SuYan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8387) [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all

2015-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587421#comment-14587421
 ] 

Apache Spark commented on SPARK-8387:
-

User 'suyanNone' has created a pull request for this issue:
https://github.com/apache/spark/pull/6834

> [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all
> -
>
> Key: SPARK-8387
> URL: https://issues.apache.org/jira/browse/SPARK-8387
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: SuYan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8387) [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8387:
---

Assignee: Apache Spark

> [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all
> -
>
> Key: SPARK-8387
> URL: https://issues.apache.org/jira/browse/SPARK-8387
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.4.0
>Reporter: SuYan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8387) [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all

2015-06-15 Thread SuYan (JIRA)
SuYan created SPARK-8387:


 Summary: [SPARK][Web-UI] Only show 4096 bytes content for executor 
log instead all
 Key: SPARK-8387
 URL: https://issues.apache.org/jira/browse/SPARK-8387
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.4.0
Reporter: SuYan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7127) Broadcast spark.ml tree ensemble models for predict

2015-06-15 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587081#comment-14587081
 ] 

Bryan Cutler edited comment on SPARK-7127 at 6/16/15 3:47 AM:
--

Hi [~josephkb],

I made some changes and added broadcasting for all ensemble models, let me know 
what you think when you get a chance.  Thanks!


was (Author: bryanc):
Hi [~josephkb],

I added some commits that allow for broadcasting ensemble models in an 
unobtrusive way, let me know what you think when you get a chance.  Thanks!

> Broadcast spark.ml tree ensemble models for predict
> ---
>
> Key: SPARK-7127
> URL: https://issues.apache.org/jira/browse/SPARK-7127
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> GBTRegressor/Classifier and RandomForestRegressor/Classifier should broadcast 
> models and then predict.  This will mean overriding transform().
> Note: Try to reduce duplicated code via the TreeEnsembleModel abstraction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7888) Be able to disable intercept in Linear Regression in ML package

2015-06-15 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-7888:
---
Assignee: holdenk

> Be able to disable intercept in Linear Regression in ML package
> ---
>
> Key: SPARK-7888
> URL: https://issues.apache.org/jira/browse/SPARK-7888
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: DB Tsai
>Assignee: holdenk
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7674) R-like stats for ML models

2015-06-15 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587376#comment-14587376
 ] 

holdenk commented on SPARK-7674:


I'd love to help with this if thats cool :)

> R-like stats for ML models
> --
>
> Key: SPARK-7674
> URL: https://issues.apache.org/jira/browse/SPARK-7674
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for supporting ML model summaries and statistics, 
> following the example of R's summary() and plot() functions.
> [Design 
> doc|https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing]
> From the design doc:
> {quote}
> R and its well-established packages provide extensive functionality for 
> inspecting a model and its results.  This inspection is critical to 
> interpreting, debugging and improving models.
> R is arguably a gold standard for a statistics/ML library, so this doc 
> largely attempts to imitate it.  The challenge we face is supporting similar 
> functionality, but on big (distributed) data.  Data size makes both efficient 
> computation and meaningful displays/summaries difficult.
> R model and result summaries generally take 2 forms:
> * summary(model): Display text with information about the model and results 
> on data
> * plot(model): Display plots about the model and results
> We aim to provide both of these types of information.  Visualization for the 
> plottable results will not be supported in MLlib itself, but we can provide 
> results in a form which can be plotted easily with other tools.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8275) HistoryServer caches incomplete App UIs

2015-06-15 Thread Carson Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587360#comment-14587360
 ] 

Carson Wang commented on SPARK-8275:


This seems to be the same issue to SPARK-7889

> HistoryServer caches incomplete App UIs
> ---
>
> Key: SPARK-8275
> URL: https://issues.apache.org/jira/browse/SPARK-8275
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.1
>Reporter: Steve Loughran
>
> The history server caches applications retrieved from the 
> {{ApplicationHistoryProvider.getAppUI()}} call for performance: it's 
> expensive to rebuild.
> However, this cache also includes incomplete applications, as well as 
> completed ones —and it never attempts to refresh the incomplete application.
> As a result, if you do a GET of the history of a running application, even 
> after the application is finished, you'll still get the web UI/history as it 
> was when that first GET was issued.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8386) DataFrame and JDBC regression

2015-06-15 Thread Peter Haumer (JIRA)
Peter Haumer created SPARK-8386:
---

 Summary: DataFrame and JDBC regression
 Key: SPARK-8386
 URL: https://issues.apache.org/jira/browse/SPARK-8386
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: RHEL 7.1
Reporter: Peter Haumer
Priority: Critical


I have an ETL app that appends to a JDBC table new results found at each run.  
In 1.3.1 I did this:

testResultsDF.insertIntoJDBC(CONNECTION_URL, TABLE_NAME, false);

When I do this now in 1.4 it complains that the "object" 'TABLE_NAME' already 
exists. I get this even if I switch the overwrite to true.  I also tried this 
now:

testResultsDF.write().mode(SaveMode.Append).jdbc(CONNECTION_URL, TABLE_NAME, 
connectionProperties);

getting the same error. It works running the first time creating the new table 
and adding data successfully. But, running it a second time it (the jdbc 
driver) will tell me that the table already exists. Even SaveMode.Overwrite 
will give me the same error. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation

2015-06-15 Thread Peter Haumer (JIRA)
Peter Haumer created SPARK-8385:
---

 Summary: java.lang.UnsupportedOperationException: Not implemented 
by the TFS FileSystem implementation
 Key: SPARK-8385
 URL: https://issues.apache.org/jira/browse/SPARK-8385
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.4.0
 Environment: RHEL 7.1
Reporter: Peter Haumer


I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I created 
a launch and just set the vm var "-Dspark.master=local[4]".  
With 1.4 this stopped working when reading files from the OS filesystem. 
Running the same apps with spark-submit works fine.  Loosing the ability to 
debug that way has a major impact on the usability of Spark.

The following exception is thrown:

Exception in thread "main" java.lang.UnsupportedOperationException: Not 
implemented by the TFS FileSystem implementation
at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166)
at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653)
at 
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389)
at 
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:900)
at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357)
at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46)
at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60)







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8368) ClassNotFoundException in closure for map

2015-06-15 Thread CHEN Zhiwei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CHEN Zhiwei updated SPARK-8368:
---
Description: 
After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the 
following exception:
==begin exception
{quote}
Exception in thread "main" java.lang.ClassNotFoundException: 
com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at 
org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
at 
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
 Source)
at 
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
 Source)
at 
org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.map(RDD.scala:293)
at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210)
at com.yhd.ycache.magic.Model$.main(SSExample.scala:239)
at com.yhd.ycache.magic.Model.main(SSExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{quote}
===end exception===

I simplify the code that cause this issue, as following:
==begin code==
{noformat}
object Model extends Serializable{
  def main(args: Array[String]) {
val Array(sql) = args
val sparkConf = new SparkConf().setAppName("Mode Example")
val sc = new SparkContext(sparkConf)
val hive = new HiveContext(sc)
//get data by hive sql
val rows = hive.sql(sql)

val data = rows.map(r => { 
  val arr = r.toSeq.toArray
  val label = 1.0
  def fmap = ( input: Any ) => 1.0
  val feature = arr.map(_=>1.0)
  LabeledPoint(label, Vectors.dense(feature))
})

data.count()
  }
}
{noformat}
=end code===
This code can run pretty well on spark-shell, but error when submit it to spark 
cluster (standalone or local mode).  I try the same code on spark 1.3.0(local 
mode), and no exception is encountered.

  was:
After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the 
following exception:
==begin exception
{quote}
Exception in thread "main" java.lang.ClassNotFoundException: 
com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at 
org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
at 
com.esotericsoftware.ref

[jira] [Commented] (SPARK-8281) udf_asin and udf_acos test failure

2015-06-15 Thread Yijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587333#comment-14587333
 ] 

Yijie Shen commented on SPARK-8281:
---

I'll take this

> udf_asin and udf_acos test failure
> --
>
> Key: SPARK-8281
> URL: https://issues.apache.org/jira/browse/SPARK-8281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> acos/asin in Hive returns NaN for not a number, whereas we always return null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8280) udf7 failed due to null vs nan semantics

2015-06-15 Thread Yijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587335#comment-14587335
 ] 

Yijie Shen commented on SPARK-8280:
---

I'll take this

> udf7 failed due to null vs nan semantics
> 
>
> Key: SPARK-8280
> URL: https://issues.apache.org/jira/browse/SPARK-8280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
>
> To execute
> {code}
> sbt/sbt -Phive -Dspark.hive.whitelist="udf7.*" "hive/test-only 
> org.apache.spark.sql.hive.execution.HiveCompatibilitySuite"
> {code}
> If we want to be consistent with Hive, we need to special case our log 
> function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8379:
---

Assignee: Apache Spark

> LeaseExpiredException when using dynamic partition with speculative execution
> -
>
> Key: SPARK-8379
> URL: https://issues.apache.org/jira/browse/SPARK-8379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: jeanlyn
>Assignee: Apache Spark
>
> when inserting to table using dynamic partitions with 
> *spark.speculation=true*  and there is a skew data of some partitions trigger 
> the speculative tasks ,it will throws the exception like
> {code}
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  Lease mismatch on 
> /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo
>  owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but 
> is accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8379:
---

Assignee: (was: Apache Spark)

> LeaseExpiredException when using dynamic partition with speculative execution
> -
>
> Key: SPARK-8379
> URL: https://issues.apache.org/jira/browse/SPARK-8379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: jeanlyn
>
> when inserting to table using dynamic partitions with 
> *spark.speculation=true*  and there is a skew data of some partitions trigger 
> the speculative tasks ,it will throws the exception like
> {code}
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  Lease mismatch on 
> /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo
>  owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but 
> is accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution

2015-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587322#comment-14587322
 ] 

Apache Spark commented on SPARK-8379:
-

User 'jeanlyn' has created a pull request for this issue:
https://github.com/apache/spark/pull/6833

> LeaseExpiredException when using dynamic partition with speculative execution
> -
>
> Key: SPARK-8379
> URL: https://issues.apache.org/jira/browse/SPARK-8379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: jeanlyn
>
> when inserting to table using dynamic partitions with 
> *spark.speculation=true*  and there is a skew data of some partitions trigger 
> the speculative tasks ,it will throws the exception like
> {code}
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  Lease mismatch on 
> /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo
>  owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but 
> is accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server

2015-06-15 Thread zhangyouhua (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587299#comment-14587299
 ] 

zhangyouhua commented on SPARK-6932:


@Qiping Li
in your idea the PS client run in slave node, but where the PS Server will run 
or deploy?

> A Prototype of Parameter Server
> ---
>
> Key: SPARK-6932
> URL: https://issues.apache.org/jira/browse/SPARK-6932
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib, Spark Core
>Reporter: Qiping Li
>
>  h2. Introduction
> As specified in 
> [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be 
> very helpful to integrate parameter server into Spark for machine learning 
> algorithms, especially for those with ultra high dimensions features. 
> After carefully studying the design doc of [Parameter 
> Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and
>  the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], 
> we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with 
> several key design concerns:
> * *User friendly interface*
>   Careful investigation is done to most existing Parameter Server 
> systems(including:  [petuum|http://petuum.github.io], [parameter 
> server|http://parameterserver.org], 
> [paracel|https://github.com/douban/paracel]) and a user friendly interface is 
> design by absorbing essence from all these system. 
> * *Prototype of distributed array*
> IndexRDD (see 
> [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem 
> to be a good option for distributed array, because in most case, the #key 
> updates/second is not be very high. 
> So we implement a distributed HashMap to store the parameters, which can 
> be easily extended to get better performance.
> 
> * *Minimal code change*
>   Quite a lot of effort in done to avoid code change of Spark core. Tasks 
> which need parameter server are still created and scheduled by Spark's 
> scheduler. Tasks communicate with parameter server with a client object, 
> through *akka* or *netty*.
> With all these concerns we propose the following architecture:
> h2. Architecture
> !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg!
> Data is stored in RDD and is partitioned across workers. During each 
> iteration, each worker gets parameters from parameter server then computes 
> new parameters based on old parameters and data in the partition. Finally 
> each worker updates parameters to parameter server.Worker communicates with 
> parameter server through a parameter server client,which is initialized in 
> `TaskContext` of this worker.
> The current implementation is based on YARN cluster mode, 
> but it should not be a problem to transplanted it to other modes. 
> h3. Interface
> We refer to existing parameter server systems(petuum, parameter server, 
> paracel) when design the interface of parameter server. 
> *`PSClient` provides the following interface for workers to use:*
> {code}
> //  get parameter indexed by key from parameter server
> def get[T](key: String): T
> // get multiple parameters from parameter server
> def multiGet[T](keys: Array[String]): Array[T]
> // add parameter indexed by `key` by `delta`, 
> // if multiple `delta` to update on the same parameter,
> // use `reduceFunc` to reduce these `delta`s frist.
> def update[T](key: String, delta: T, reduceFunc: (T, T) => T): Unit
> // update multiple parameters at the same time, use the same `reduceFunc`.
> def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) => 
> T: Unit
> 
> // advance clock to indicate that current iteration is finished.
> def clock(): Unit
>  
> // block until all workers have reached this line of code.
> def sync(): Unit
> {code}
> *`PSContext` provides following functions to use on driver:*
> {code}
> // load parameters from existing rdd.
> def loadPSModel[T](model: RDD[String, T]) 
> // fetch parameters from parameter server to construct model.
> def fetchPSModel[T](keys: Array[String]): Array[T]
> {code} 
> 
> *A new function has been add to `RDD` to run parameter server tasks:*
> {code}
> // run the provided `func` on each partition of this RDD. 
> // This function can use data of this partition(the first argument) 
> // and a parameter server client(the second argument). 
> // See the following Logistic Regression for an example.
> def runWithPS[U: ClassTag](func: (Array[T], PSClient) => U): Array[U]
>
> {code}
> h2. Example
> Here is an example of using our prototype to implement logistic regression:
> {code:title=LogisticRegression.scala|borderStyle=solid}
> def train(
> sc: SparkContext,
> input: RDD[LabeledPoint],
> numIterations: In

[jira] [Commented] (SPARK-7633) Streaming Logistic Regression- Python bindings

2015-06-15 Thread Mike Dusenberry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587286#comment-14587286
 ] 

Mike Dusenberry commented on SPARK-7633:


I can work on this one!

> Streaming Logistic Regression- Python bindings
> --
>
> Key: SPARK-7633
> URL: https://issues.apache.org/jira/browse/SPARK-7633
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Yanbo Liang
>
> Add Python API for StreamingLogisticRegressionWithSGD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7555) User guide update for ElasticNet

2015-06-15 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-7555:
---
Assignee: Shuo Xiang  (was: DB Tsai)

> User guide update for ElasticNet
> 
>
> Key: SPARK-7555
> URL: https://issues.apache.org/jira/browse/SPARK-7555
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Shuo Xiang
>
> Copied from [SPARK-7443]:
> {quote}
> Now that we have algorithms in spark.ml which are not in spark.mllib, we 
> should start making subsections for the spark.ml API as needed. We can follow 
> the structure of the spark.mllib user guide.
> * The spark.ml user guide can provide: (a) code examples and (b) info on 
> algorithms which do not exist in spark.mllib.
> * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
> still the primary API, we should provide links to the corresponding 
> algorithms in the spark.mllib user guide for more info.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7685) Handle high imbalanced data and apply weights to different samples in Logistic Regression

2015-06-15 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-7685:
---
Assignee: Shuo Xiang

> Handle high imbalanced data and apply weights to different samples in 
> Logistic Regression
> -
>
> Key: SPARK-7685
> URL: https://issues.apache.org/jira/browse/SPARK-7685
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: DB Tsai
>Assignee: Shuo Xiang
>
> In fraud detection dataset, almost all the samples are negative while only 
> couple of them are positive. This type of high imbalanced data will bias the 
> models toward negative resulting poor performance. In python-scikit, they 
> provide a correction allowing users to Over-/undersample the samples of each 
> class according to the given weights. In auto mode, selects weights inversely 
> proportional to class frequencies in the training set. This can be done in a 
> more efficient way by multiplying the weights into loss and gradient instead 
> of doing actual over/undersampling in the training dataset which is very 
> expensive.
> http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
> On the other hand, some of the training data maybe more important like the 
> training samples from tenure users while the training samples from new users 
> maybe less important. We should be able to provide another "weight: Double" 
> information in the LabeledPoint to weight them differently in the learning 
> algorithm. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8336) Fix NullPointerException with functions.rand()

2015-06-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8336.

  Resolution: Fixed
   Fix Version/s: 1.5.0
  1.4.1
Target Version/s:   (was: 1.5.0)

> Fix NullPointerException with functions.rand()
> --
>
> Key: SPARK-8336
> URL: https://issues.apache.org/jira/browse/SPARK-8336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Ted Yu
>Assignee: Ted Yu
> Fix For: 1.4.1, 1.5.0
>
>
> The problem was first reported by Justin Yip in the thread 
> 'NullPointerException with functions.rand()'
> Here is how to reproduce the problem:
> {code}
> sqlContext.createDataFrame(Seq((1,2), (3, 100))).withColumn("index", 
> rand(30)).show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8384) Can not set checkpointDuration or Interval in spark 1.3 and later

2015-06-15 Thread Norman He (JIRA)
Norman He created SPARK-8384:


 Summary: Can not set checkpointDuration or Interval in spark 1.3 
and later
 Key: SPARK-8384
 URL: https://issues.apache.org/jira/browse/SPARK-8384
 Project: Spark
  Issue Type: Bug
Reporter: Norman He
Priority: Critical


StreamingContext missing setCheckpointDuration().

No way around for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7127) Broadcast spark.ml tree ensemble models for predict

2015-06-15 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587081#comment-14587081
 ] 

Bryan Cutler commented on SPARK-7127:
-

Hi [~josephkb],

I added some commits that allow for broadcasting ensemble models in an 
unobtrusive way, let me know what you think when you get a chance.  Thanks!

> Broadcast spark.ml tree ensemble models for predict
> ---
>
> Key: SPARK-7127
> URL: https://issues.apache.org/jira/browse/SPARK-7127
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> GBTRegressor/Classifier and RandomForestRegressor/Classifier should broadcast 
> models and then predict.  This will mean overriding transform().
> Note: Try to reduce duplicated code via the TreeEnsembleModel abstraction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8335) DecisionTreeModel.predict() return type not convenient!

2015-06-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587013#comment-14587013
 ] 

Sean Owen commented on SPARK-8335:
--

Go ahead and propose a PR. The sticky issue here is whether it's ok to change 
an experimental API at this point. I think so.

> DecisionTreeModel.predict() return type not convenient!
> ---
>
> Key: SPARK-8335
> URL: https://issues.apache.org/jira/browse/SPARK-8335
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Sebastian Walz
>Priority: Minor
>  Labels: easyfix, machine_learning
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> org.apache.spark.mllib.tree.model.DecisionTreeModel has a predict method:
> def predict(features: JavaRDD[Vector]): JavaRDD[Double]
> The problem here is the generic type of the return type JAVARDD[Double] 
> because its a scala Double and I would expect a java.lang.Double. (to be 
> convenient e.g. with 
> org.apache.spark.mllib.classification.ClassificationModel)
> I wanted to extend the DecisionTreeModel and use it only for Binary 
> Classification and wanted to implement the trait 
> org.apache.spark.mllib.classification.ClassificationModel . But its not 
> possible because the ClassificationModel already defines the predict method 
> but with an return type JAVARDD[java.lang.Double]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8383) Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed

2015-06-15 Thread Irina Easterling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586648#comment-14586648
 ] 

Irina Easterling commented on SPARK-8383:
-

Spark History Server shows Last Updated as 1969/12/31 when SparkPI application 
completed
Steps to reproduce:
1. Install Spark thru Ambari Wizard
2. After installation run the Spark Pi Example
3. Navigate to your Spark directory:
baron1:~ # cd /usr/hdp/current/spark-client/
baron1:/usr/hdp/current/spark-client # su spark
spark@baron1:/usr/hdp/current/spark-client> spark-submit --verbose --class 
org.apache.spark.examples.SparkPi --master yarn-cluster 
--num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 
1 lib/spark-examples*.jar 10
4. When job completed.
5. Access to the Ambari > Spark> Spark History Server UI
6. Click on 'Show incomplete applications' link
7. View the result for completed job
//Results
Last Updated column shows date/time as 1969/12/31 19:00:00 (screenshot attacked)
8. Verify that Spark job completed in YARN. (screenshot attached)

There also discrepancy between SparkHistory Server WebUI and 
YARN/ResourceManager WebUI. Spark job completed and it is shown in the 
YARN/Resource Manager WebUI. In the SparkHistroyServer WebUI it shows as 
Uncompleted. 
See attached screenshots.


> Spark History Server shows Last Updated as 1969/12/31 when SparkPI 
> application completed 
> -
>
> Key: SPARK-8383
> URL: https://issues.apache.org/jira/browse/SPARK-8383
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.3.1
> Environment: Spark1.3.1.2.3
>Reporter: Irina Easterling
> Attachments: Spark_WrongLastUpdatedDate.png, 
> YARN_SparkJobCompleted.PNG
>
>
> Spark History Server shows Last Updated as 1969/12/31 when SparkPI 
> application completed and Started Date is 2015/06/10 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8383) Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed

2015-06-15 Thread Irina Easterling (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Easterling updated SPARK-8383:

Attachment: YARN_SparkJobCompleted.PNG

> Spark History Server shows Last Updated as 1969/12/31 when SparkPI 
> application completed 
> -
>
> Key: SPARK-8383
> URL: https://issues.apache.org/jira/browse/SPARK-8383
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.3.1
> Environment: Spark1.3.1.2.3
>Reporter: Irina Easterling
> Attachments: Spark_WrongLastUpdatedDate.png, 
> YARN_SparkJobCompleted.PNG
>
>
> Spark History Server shows Last Updated as 1969/12/31 when SparkPI 
> application completed and Started Date is 2015/06/10 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8383) Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed

2015-06-15 Thread Irina Easterling (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irina Easterling updated SPARK-8383:

Attachment: Spark_WrongLastUpdatedDate.png

> Spark History Server shows Last Updated as 1969/12/31 when SparkPI 
> application completed 
> -
>
> Key: SPARK-8383
> URL: https://issues.apache.org/jira/browse/SPARK-8383
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.3.1
> Environment: Spark1.3.1.2.3
>Reporter: Irina Easterling
> Attachments: Spark_WrongLastUpdatedDate.png
>
>
> Spark History Server shows Last Updated as 1969/12/31 when SparkPI 
> application completed and Started Date is 2015/06/10 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8383) Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed

2015-06-15 Thread Irina Easterling (JIRA)
Irina Easterling created SPARK-8383:
---

 Summary: Spark History Server shows Last Updated as 1969/12/31 
when SparkPI application completed 
 Key: SPARK-8383
 URL: https://issues.apache.org/jira/browse/SPARK-8383
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.3.1
 Environment: Spark 1.3.1.2.3
Reporter: Irina Easterling


Spark History Server shows Last Updated as 1969/12/31 when SparkPI application 
completed and Started Date is 2015/06/10 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-06-15 Thread Daniel LaBar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586616#comment-14586616
 ] 

Daniel LaBar commented on SPARK-6220:
-

Ok, I'll create a new JIRA with a reference to this one.

Thanks for checking the commit.  Our IT security team only gives us AWS keys 
for a "service account", but we don't have access to EC2, EMR, S3, etc. from 
this account.  In order to do anything useful we have to switch roles using the 
service account credentials and MFA.  But the Spark EC2 script doesn't seem to 
work with anything other than the AWS key/secret.  So I use the service account 
credentials to create an EC2 instance with an IAM profile that can do useful 
things.  I SSH into that EC2 instance, and then launch the EC2 Spark cluster 
from there using the modified spark_ec2.py script.

> Allow extended EC2 options to be passed through spark-ec2
> -
>
> Key: SPARK-6220
> URL: https://issues.apache.org/jira/browse/SPARK-6220
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There are many EC2 options exposed by the boto library that spark-ec2 uses. 
> Over time, many of these EC2 options have been bubbled up here and there to 
> become spark-ec2 options.
> Examples:
> * spot prices
> * placement groups
> * VPC, subnet, and security group assignments
> It's likely that more and more EC2 options will trickle up like this to 
> become spark-ec2 options.
> While major options are well suited to this type of promotion, we should 
> probably allow users to pass through EC2 options they want to use through 
> spark-ec2 in some generic way.
> Let's add two options:
> * {{--ec2-instance-option}} -> 
> [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
> * {{--ec2-spot-instance-option}} -> 
> [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
> Each option can be specified multiple times and is simply passed directly to 
> the underlying boto call.
> For example:
> {code}
> spark-ec2 \
> ...
> --ec2-instance-option "instance_initiated_shutdown_behavior=terminate" \
> --ec2-instance-option "ebs_optimized=True"
> {code}
> I'm not sure about the exact syntax of the extended options, but something 
> like this will do the trick as long as it can be made to pass the options 
> correctly to boto in most cases.
> I followed the example of {{ssh}}, which supports multiple extended options 
> similarly.
> {code}
> ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6583) Support aggregated function in order by

2015-06-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6583.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6816
[https://github.com/apache/spark/pull/6816]

> Support aggregated function in order by
> ---
>
> Key: SPARK-6583
> URL: https://issues.apache.org/jira/browse/SPARK-6583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yadong Qi
>Assignee: Yadong Qi
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8382) Improve Analysis Unit test framework

2015-06-15 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-8382:
---

 Summary: Improve Analysis Unit test framework
 Key: SPARK-8382
 URL: https://issues.apache.org/jira/browse/SPARK-8382
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust


We have some nice frameworks for doing various unit test {{checkAnswer}}, 
{{comparePlan}}, {{checkEvaluation}}, etc.  However {{AnalysisSuite}} is kind 
of sloppy with each test using assertions in different ways.  I'd like a 
function that looks something like the following:

{code}
def checkAnalysis(
  inputPlan: LogicalPlan,
  expectedPlan: LogicalPlan = null,
  caseInsensitiveOnly: Boolean = false,
  expectedErrors: Seq[String] = Nil)
{code}

This function should construct tests that check the Analyzer works as expected 
and provides useful error messages when any failures are encountered.  We 
should then rewrite the existing tests and beef up our coverage here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-06-15 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586334#comment-14586334
 ] 

Nicholas Chammas commented on SPARK-6220:
-

> please forgive my greenness

No need. Greenness is not a crime around these parts. :)

I suggest creating a new JIRA for that specific feature. In the JIRA you can 
reference this issue here as related.

By the way, I took a look at your commit. If I understood correctly, your 
change associates launched instances with an IAM profile (allowing the launched 
cluster to, for example, access S3 without credentials), but the machine you 
are running spark-ec2 from still needs AWS keys to launch them.

That seems fine to me, but it doesn't sound exactly like what you intended from 
your comment.

> Allow extended EC2 options to be passed through spark-ec2
> -
>
> Key: SPARK-6220
> URL: https://issues.apache.org/jira/browse/SPARK-6220
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There are many EC2 options exposed by the boto library that spark-ec2 uses. 
> Over time, many of these EC2 options have been bubbled up here and there to 
> become spark-ec2 options.
> Examples:
> * spot prices
> * placement groups
> * VPC, subnet, and security group assignments
> It's likely that more and more EC2 options will trickle up like this to 
> become spark-ec2 options.
> While major options are well suited to this type of promotion, we should 
> probably allow users to pass through EC2 options they want to use through 
> spark-ec2 in some generic way.
> Let's add two options:
> * {{--ec2-instance-option}} -> 
> [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
> * {{--ec2-spot-instance-option}} -> 
> [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
> Each option can be specified multiple times and is simply passed directly to 
> the underlying boto call.
> For example:
> {code}
> spark-ec2 \
> ...
> --ec2-instance-option "instance_initiated_shutdown_behavior=terminate" \
> --ec2-instance-option "ebs_optimized=True"
> {code}
> I'm not sure about the exact syntax of the extended options, but something 
> like this will do the trick as long as it can be made to pass the options 
> correctly to boto in most cases.
> I followed the example of {{ssh}}, which supports multiple extended options 
> similarly.
> {code}
> ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type

2015-06-15 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-8381:

Description: This method CatalystTypeConverters.convertToCatalyst is slow, 
so for batch conversion we should be using converter produced by 
createToCatalystConverter.  (was: This method 
CatalystTypeConverters.convertToCatalyst is slow, and for batch conversion we 
should be using converter produced by createToCatalystConverter.)

> reuse typeConvert when convert Seq[Row] to catalyst type
> 
>
> Key: SPARK-8381
> URL: https://issues.apache.org/jira/browse/SPARK-8381
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Lianhui Wang
>
> This method CatalystTypeConverters.convertToCatalyst is slow, so for batch 
> conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8381:
---

Assignee: Apache Spark

> reuse typeConvert when convert Seq[Row] to catalyst type
> 
>
> Key: SPARK-8381
> URL: https://issues.apache.org/jira/browse/SPARK-8381
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Lianhui Wang
>Assignee: Apache Spark
>
> This method CatalystTypeConverters.convertToCatalyst is slow, and for batch 
> conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8381:
---

Assignee: (was: Apache Spark)

> reuse typeConvert when convert Seq[Row] to catalyst type
> 
>
> Key: SPARK-8381
> URL: https://issues.apache.org/jira/browse/SPARK-8381
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Lianhui Wang
>
> This method CatalystTypeConverters.convertToCatalyst is slow, and for batch 
> conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type

2015-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586330#comment-14586330
 ] 

Apache Spark commented on SPARK-8381:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/6831

> reuse typeConvert when convert Seq[Row] to catalyst type
> 
>
> Key: SPARK-8381
> URL: https://issues.apache.org/jira/browse/SPARK-8381
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Lianhui Wang
>
> This method CatalystTypeConverters.convertToCatalyst is slow, and for batch 
> conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type

2015-06-15 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-8381:

Summary: reuse typeConvert when convert Seq[Row] to catalyst type  (was: 
reuse-typeConvert when convert Seq[Row] to catalyst type)

> reuse typeConvert when convert Seq[Row] to catalyst type
> 
>
> Key: SPARK-8381
> URL: https://issues.apache.org/jira/browse/SPARK-8381
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Lianhui Wang
>
> This method CatalystTypeConverters.convertToCatalyst is slow, and for batch 
> conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8381) reuse-typeConvert when convert Seq[Row] to catalyst type

2015-06-15 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-8381:

Summary: reuse-typeConvert when convert Seq[Row] to catalyst type  (was: 
reuse-typeConvert when convert Seq[Row] to CatalystType)

> reuse-typeConvert when convert Seq[Row] to catalyst type
> 
>
> Key: SPARK-8381
> URL: https://issues.apache.org/jira/browse/SPARK-8381
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Lianhui Wang
>
> This method CatalystTypeConverters.convertToCatalyst is slow, and for batch 
> conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8381) reuse-typeConvert when convert Seq[Row] to CatalystType

2015-06-15 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-8381:
---

 Summary: reuse-typeConvert when convert Seq[Row] to CatalystType
 Key: SPARK-8381
 URL: https://issues.apache.org/jira/browse/SPARK-8381
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Lianhui Wang


This method CatalystTypeConverters.convertToCatalyst is slow, and for batch 
conversion we should be using converter produced by createToCatalystConverter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8322) EC2 script not fully updated for 1.4.0 release

2015-06-15 Thread Mark Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Smith closed SPARK-8322.
-

Thanks for making my first PR so painless guys.

> EC2 script not fully updated for 1.4.0 release
> --
>
> Key: SPARK-8322
> URL: https://issues.apache.org/jira/browse/SPARK-8322
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.4.0
>Reporter: Mark Smith
>Assignee: Mark Smith
>  Labels: easyfix
> Fix For: 1.4.1, 1.5.0
>
>
> In the spark_ec2.py script, the "1.4.0" spark version hasn't been added to 
> the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to 
> break for the latest release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7721) Generate test coverage report from Python

2015-06-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586280#comment-14586280
 ] 

Josh Rosen commented on SPARK-7721:
---

We now have the Jenkins HTML publisher plugin installed, so we can now easily 
publish HTML reports from tools from coverage.py 
(https://wiki.jenkins-ci.org/display/JENKINS/HTML+Publisher+Plugin).  I might 
give this a try on NewSparkPullRequestBuilder today. 

> Generate test coverage report from Python
> -
>
> Key: SPARK-7721
> URL: https://issues.apache.org/jira/browse/SPARK-7721
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Reporter: Reynold Xin
>
> Would be great to have test coverage report for Python. Compared with Scala, 
> it is tricker to understand the coverage without coverage reports in Python 
> because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8380) SparkR mis-counts

2015-06-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586277#comment-14586277
 ] 

Shivaram Venkataraman commented on SPARK-8380:
--

[~RPCMoritz] Couple of things that would be interesting to see 

1. Does the `sql` command in SparkR work correctly ?
2. Can you try the dataframe statements in Scala and see what results you get ?

cc [~rxin]

> SparkR mis-counts
> -
>
> Key: SPARK-8380
> URL: https://issues.apache.org/jira/browse/SPARK-8380
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Rick Moritz
>
> On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can 
> perform count operations on the entirety of the dataset and get the correct 
> value, as double checked against the same code in scala.
> When I start to add conditions or even do a simple partial ascending 
> histogram, I get discrepancies.
> In particular, there are missing values in SparkR, and massively so:
> A top 6 count of a certain feature in my dataset results in an order of 
> magnitude smaller numbers, than I get via scala.
> The following logic, which I consider equivalent is the basis for this report:
> counts<-summarize(groupBy(df, df$col_name), count = n(tdf$col_name))
> head(arrange(counts, desc(counts$count)))
> versus:
> val table = sql("SELECT col_name, count(col_name) as value from df  group by 
> col_name order by value desc")
> The first, in particular, is taken directly from the SparkR programming 
> guide. Since summarize isn't documented from what I can see, I'd hope it does 
> what the programming guide indicates. In that case this would be a pretty 
> serious logic bug (no errors are thrown). Otherwise, there's the possibility 
> of a lack of documentation and badly worded example in the guide being behind 
> my misperception of SparkRs functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-06-15 Thread Daniel LaBar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586276#comment-14586276
 ] 

Daniel LaBar commented on SPARK-6220:
-

[~nchammas], I also need IAM support and [made a few changes to 
spark_ec2.py|https://github.com/dnlbrky/spark/commit/5d4a9c65728245dc501c2a7c479ca27b6f685bd8],
 including an {{--instance-profile-name}} option.  These modifications let me 
successfully create security groups and the master/slaves without specifying an 
access key and secret, but I'm still having issues getting Hadoop/Yarn setup so 
it may require further changes.  Please let me know if you have suggestions.

This would be my first time contributing to an Apache project and I'm new to 
Spark/Python, so please forgive my greenness... Should I create another JIRA 
specifically to add instance profile support, or can I reference this JIRA when 
submitting a pull request?

> Allow extended EC2 options to be passed through spark-ec2
> -
>
> Key: SPARK-6220
> URL: https://issues.apache.org/jira/browse/SPARK-6220
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There are many EC2 options exposed by the boto library that spark-ec2 uses. 
> Over time, many of these EC2 options have been bubbled up here and there to 
> become spark-ec2 options.
> Examples:
> * spot prices
> * placement groups
> * VPC, subnet, and security group assignments
> It's likely that more and more EC2 options will trickle up like this to 
> become spark-ec2 options.
> While major options are well suited to this type of promotion, we should 
> probably allow users to pass through EC2 options they want to use through 
> spark-ec2 in some generic way.
> Let's add two options:
> * {{--ec2-instance-option}} -> 
> [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
> * {{--ec2-spot-instance-option}} -> 
> [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
> Each option can be specified multiple times and is simply passed directly to 
> the underlying boto call.
> For example:
> {code}
> spark-ec2 \
> ...
> --ec2-instance-option "instance_initiated_shutdown_behavior=terminate" \
> --ec2-instance-option "ebs_optimized=True"
> {code}
> I'm not sure about the exact syntax of the extended options, but something 
> like this will do the trick as long as it can be made to pass the options 
> correctly to boto in most cases.
> I followed the example of {{ssh}}, which supports multiple extended options 
> similarly.
> {code}
> ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-06-15 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586266#comment-14586266
 ] 

Igor Berman commented on SPARK-4879:


I'm experiencing this issue. Sometimes rdd with 4 partitions is written with 3 
parts and _SUCCESS marker is there.

> Missing output partitions after job completes with speculative execution
> 
>
> Key: SPARK-4879
> URL: https://issues.apache.org/jira/browse/SPARK-4879
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.3.0
>
> Attachments: speculation.txt, speculation2.txt
>
>
> When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
> save output files may report that they have completed successfully even 
> though some output partitions written by speculative tasks may be missing.
> h3. Reproduction
> This symptom was reported to me by a Spark user and I've been doing my own 
> investigation to try to come up with an in-house reproduction.
> I'm still working on a reliable local reproduction for this issue, which is a 
> little tricky because Spark won't schedule speculated tasks on the same host 
> as the original task, so you need an actual (or containerized) multi-host 
> cluster to test speculation.  Here's a simple reproduction of some of the 
> symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
> spark.speculation=true}}:
> {code}
> // Rig a job such that all but one of the tasks complete instantly
> // and one task runs for 20 seconds on its first attempt and instantly
> // on its second attempt:
> val numTasks = 100
> sc.parallelize(1 to numTasks, 
> numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =>
>   if (ctx.partitionId == 0) {  // If this is the one task that should run 
> really slow
> if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
>  Thread.sleep(20 * 1000)
> }
>   }
>   iter
> }.map(x => (x, x)).saveAsTextFile("/test4")
> {code}
> When I run this, I end up with a job that completes quickly (due to 
> speculation) but reports failures from the speculated task:
> {code}
> [...]
> 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
> 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
> (100/100)
> 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
> :22) finished in 0.856 s
> 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
> :22, took 0.885438374 s
> 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
> for 70.1 in stage 3.0 because task 70 has already completed successfully
> scala> 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
> stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
> java.io.IOException: Failed to save output of task: 
> attempt_201412110141_0003_m_49_413
> 
> org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
> 
> org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
> 
> org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
> 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)
> 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> One interesting thing to note about this stack trace: if we look at 
> {{FileOutputCommitter.java:160}} 
> ([link|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.5.0-mr1-cdh5.2.0/org/apache/hadoop/mapred/FileOutputCommitter.java#160]),
>  this point in the execution seems to correspond to a case where a task 
> completes, attempts to commit its output, fails for some reason, then deletes 
> the destination file, tries again, and fails:
> {code}
>  if (fs.isFile(taskOutput)) {
> 152  Path finalOutputPath = getFinal

[jira] [Commented] (SPARK-8380) SparkR mis-counts

2015-06-15 Thread Rick Moritz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586262#comment-14586262
 ] 

Rick Moritz commented on SPARK-8380:


I will attempt to reproduce this with an alternate dataset asap, but getting 
large volume datasets into this cluster is difficult.

> SparkR mis-counts
> -
>
> Key: SPARK-8380
> URL: https://issues.apache.org/jira/browse/SPARK-8380
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Rick Moritz
>
> On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can 
> perform count operations on the entirety of the dataset and get the correct 
> value, as double checked against the same code in scala.
> When I start to add conditions or even do a simple partial ascending 
> histogram, I get discrepancies.
> In particular, there are missing values in SparkR, and massively so:
> A top 6 count of a certain feature in my dataset results in an order of 
> magnitude smaller numbers, than I get via scala.
> The following logic, which I consider equivalent is the basis for this report:
> counts<-summarize(groupBy(df, df$col_name), count = n(tdf$col_name))
> head(arrange(counts, desc(counts$count)))
> versus:
> val table = sql("SELECT col_name, count(col_name) as value from df  group by 
> col_name order by value desc")
> The first, in particular, is taken directly from the SparkR programming 
> guide. Since summarize isn't documented from what I can see, I'd hope it does 
> what the programming guide indicates. In that case this would be a pretty 
> serious logic bug (no errors are thrown). Otherwise, there's the possibility 
> of a lack of documentation and badly worded example in the guide being behind 
> my misperception of SparkRs functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8380) SparkR mis-counts

2015-06-15 Thread Rick Moritz (JIRA)
Rick Moritz created SPARK-8380:
--

 Summary: SparkR mis-counts
 Key: SPARK-8380
 URL: https://issues.apache.org/jira/browse/SPARK-8380
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Rick Moritz


On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can perform 
count operations on the entirety of the dataset and get the correct value, as 
double checked against the same code in scala.
When I start to add conditions or even do a simple partial ascending histogram, 
I get discrepancies.

In particular, there are missing values in SparkR, and massively so:
A top 6 count of a certain feature in my dataset results in an order of 
magnitude smaller numbers, than I get via scala.

The following logic, which I consider equivalent is the basis for this report:

counts<-summarize(groupBy(df, df$col_name), count = n(tdf$col_name))
head(arrange(counts, desc(counts$count)))

versus:

val table = sql("SELECT col_name, count(col_name) as value from df  group by 
col_name order by value desc")

The first, in particular, is taken directly from the SparkR programming guide. 
Since summarize isn't documented from what I can see, I'd hope it does what the 
programming guide indicates. In that case this would be a pretty serious logic 
bug (no errors are thrown). Otherwise, there's the possibility of a lack of 
documentation and badly worded example in the guide being behind my 
misperception of SparkRs functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8335) DecisionTreeModel.predict() return type not convenient!

2015-06-15 Thread Sebastian Walz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586248#comment-14586248
 ] 

Sebastian Walz commented on SPARK-8335:
---

Yeah I am sure, that is a really a scala.Double. I just looked it up again on 
github. So the problem still exists in on the current master branch. 

> DecisionTreeModel.predict() return type not convenient!
> ---
>
> Key: SPARK-8335
> URL: https://issues.apache.org/jira/browse/SPARK-8335
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Sebastian Walz
>Priority: Minor
>  Labels: easyfix, machine_learning
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> org.apache.spark.mllib.tree.model.DecisionTreeModel has a predict method:
> def predict(features: JavaRDD[Vector]): JavaRDD[Double]
> The problem here is the generic type of the return type JAVARDD[Double] 
> because its a scala Double and I would expect a java.lang.Double. (to be 
> convenient e.g. with 
> org.apache.spark.mllib.classification.ClassificationModel)
> I wanted to extend the DecisionTreeModel and use it only for Binary 
> Classification and wanted to implement the trait 
> org.apache.spark.mllib.classification.ClassificationModel . But its not 
> possible because the ClassificationModel already defines the predict method 
> but with an return type JAVARDD[java.lang.Double]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution

2015-06-15 Thread jeanlyn (JIRA)
jeanlyn created SPARK-8379:
--

 Summary: LeaseExpiredException when using dynamic partition with 
speculative execution
 Key: SPARK-8379
 URL: https://issues.apache.org/jira/browse/SPARK-8379
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.3.1, 1.3.0
Reporter: jeanlyn


when inserting to table using dynamic partitions with *spark.speculation=true*  
and there is a skew data of some partitions trigger the speculative tasks ,it 
will throws the exception like
{code}
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 Lease mismatch on 
/tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo
 owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but is 
accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5081) Shuffle write increases

2015-06-15 Thread Roi Reshef (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roi Reshef updated SPARK-5081:
--
Comment: was deleted

(was: Hi Guys,
Was this issue already solved by any chance? I'm using Spark 1.3.1 for training 
algorithm with an iterative fashion. Since implementing a ranking measure (that 
ultimately uses sortBy) i'm experiencing similar problems. It seems that my 
cache explodes after ~100 iterations, and crushes the server with a "There is 
insufficient memory for the Java Runtime Environment to continue" message. Note 
that it isn't supposed to persist the sorted vectors nor to use them in the 
following iterations. So I wonder why memory consumption keeps growing with 
each iteration.)

> Shuffle write increases
> ---
>
> Key: SPARK-5081
> URL: https://issues.apache.org/jira/browse/SPARK-5081
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
>Reporter: Kevin Jung
>Priority: Critical
> Attachments: Spark_Debug.pdf, diff.txt
>
>
> The size of shuffle write showing in spark web UI is much different when I 
> execute same spark job with same input data in both spark 1.1 and spark 1.2. 
> At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
> in spark 1.2. 
> I set spark.shuffle.manager option to hash because it's default value is 
> changed but spark 1.2 still writes shuffle output more than spark 1.1.
> It can increase disk I/O overhead exponentially as the input file gets bigger 
> and it causes the jobs take more time to complete. 
> In the case of about 100GB input, for example, the size of shuffle write is 
> 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
> spark 1.1
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |9|saveAsTextFile| |1169.4KB| |
> |12|combineByKey| |1265.4KB|1275.0KB|
> |6|sortByKey| |1276.5KB| |
> |8|mapPartitions| |91.0MB|1383.1KB|
> |4|apply| |89.4MB| |
> |5|sortBy|155.6MB| |98.1MB|
> |3|sortBy|155.6MB| | |
> |1|collect| |2.1MB| |
> |2|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |
> spark 1.2
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |12|saveAsTextFile| |1170.2KB| |
> |11|combineByKey| |1264.5KB|1275.0KB|
> |8|sortByKey| |1273.6KB| |
> |7|mapPartitions| |134.5MB|1383.1KB|
> |5|zipWithIndex| |132.5MB| |
> |4|sortBy|155.6MB| |146.9MB|
> |3|sortBy|155.6MB| | |
> |2|collect| |2.0MB| |
> |1|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8350) R unit tests output should be logged to "unit-tests.log"

2015-06-15 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8350.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6807
[https://github.com/apache/spark/pull/6807]

> R unit tests output should be logged to "unit-tests.log"
> 
>
> Key: SPARK-8350
> URL: https://issues.apache.org/jira/browse/SPARK-8350
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 1.5.0
>
>
> Right now it's logged to "R-unit-tests.log". Jenkins currently only archives 
> files named "unit-tests.log", and this is what all other modules (e.g. SQL, 
> network, REPL) use.
> 1. We should be consistent
> 2. I don't want to reconfigure Jenkins to accept a different file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8378) Add Spark Flume Python API

2015-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586144#comment-14586144
 ] 

Apache Spark commented on SPARK-8378:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/6830

> Add Spark Flume Python API
> --
>
> Key: SPARK-8378
> URL: https://issues.apache.org/jira/browse/SPARK-8378
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8378) Add Spark Flume Python API

2015-06-15 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-8378:
---

 Summary: Add Spark Flume Python API
 Key: SPARK-8378
 URL: https://issues.apache.org/jira/browse/SPARK-8378
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8378) Add Spark Flume Python API

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8378:
---

Assignee: Apache Spark

> Add Spark Flume Python API
> --
>
> Key: SPARK-8378
> URL: https://issues.apache.org/jira/browse/SPARK-8378
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8378) Add Spark Flume Python API

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8378:
---

Assignee: (was: Apache Spark)

> Add Spark Flume Python API
> --
>
> Key: SPARK-8378
> URL: https://issues.apache.org/jira/browse/SPARK-8378
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4644) Implement skewed join

2015-06-15 Thread Nathan McCarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14586076#comment-14586076
 ] 

Nathan McCarthy commented on SPARK-4644:


Something like this to make working with skewed data in spark easier would be 
very helpful 

> Implement skewed join
> -
>
> Key: SPARK-4644
> URL: https://issues.apache.org/jira/browse/SPARK-4644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
> Attachments: Skewed Join Design Doc.pdf
>
>
> Skewed data is not rare. For example, a book recommendation site may have 
> several books which are liked by most of the users. Running ALS on such 
> skewed data will raise a OutOfMemory error, if some book has too many users 
> which cannot be fit into memory. To solve it, we propose a skewed join 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8267) string function: trim

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8267:
---

Assignee: Apache Spark  (was: Cheng Hao)

> string function: trim
> -
>
> Key: SPARK-8267
> URL: https://issues.apache.org/jira/browse/SPARK-8267
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> trim(string A): string
> Returns the string resulting from trimming spaces from both ends of A. For 
> example, trim(' foobar ') results in 'foobar'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8253) string function: ltrim

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8253:
---

Assignee: Apache Spark  (was: Cheng Hao)

> string function: ltrim
> --
>
> Key: SPARK-8253
> URL: https://issues.apache.org/jira/browse/SPARK-8253
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> ltrim(string A): string
> Returns the string resulting from trimming spaces from the beginning(left 
> hand side) of A. For example, ltrim(' foobar ') results in 'foobar '.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8260) string function: rtrim

2015-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585934#comment-14585934
 ] 

Apache Spark commented on SPARK-8260:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6762

> string function: rtrim
> --
>
> Key: SPARK-8260
> URL: https://issues.apache.org/jira/browse/SPARK-8260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Hao
>
> rtrim(string A): string
> Returns the string resulting from trimming spaces from the end(right hand 
> side) of A. For example, rtrim(' foobar ') results in ' foobar'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8260) string function: rtrim

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8260:
---

Assignee: Cheng Hao  (was: Apache Spark)

> string function: rtrim
> --
>
> Key: SPARK-8260
> URL: https://issues.apache.org/jira/browse/SPARK-8260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Hao
>
> rtrim(string A): string
> Returns the string resulting from trimming spaces from the end(right hand 
> side) of A. For example, rtrim(' foobar ') results in ' foobar'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8253) string function: ltrim

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8253:
---

Assignee: Cheng Hao  (was: Apache Spark)

> string function: ltrim
> --
>
> Key: SPARK-8253
> URL: https://issues.apache.org/jira/browse/SPARK-8253
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Hao
>
> ltrim(string A): string
> Returns the string resulting from trimming spaces from the beginning(left 
> hand side) of A. For example, ltrim(' foobar ') results in 'foobar '.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8267) string function: trim

2015-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585935#comment-14585935
 ] 

Apache Spark commented on SPARK-8267:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6762

> string function: trim
> -
>
> Key: SPARK-8267
> URL: https://issues.apache.org/jira/browse/SPARK-8267
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Hao
>
> trim(string A): string
> Returns the string resulting from trimming spaces from both ends of A. For 
> example, trim(' foobar ') results in 'foobar'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8253) string function: ltrim

2015-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585933#comment-14585933
 ] 

Apache Spark commented on SPARK-8253:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6762

> string function: ltrim
> --
>
> Key: SPARK-8253
> URL: https://issues.apache.org/jira/browse/SPARK-8253
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Hao
>
> ltrim(string A): string
> Returns the string resulting from trimming spaces from the beginning(left 
> hand side) of A. For example, ltrim(' foobar ') results in 'foobar '.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8267) string function: trim

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8267:
---

Assignee: Cheng Hao  (was: Apache Spark)

> string function: trim
> -
>
> Key: SPARK-8267
> URL: https://issues.apache.org/jira/browse/SPARK-8267
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Cheng Hao
>
> trim(string A): string
> Returns the string resulting from trimming spaces from both ends of A. For 
> example, trim(' foobar ') results in 'foobar'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8260) string function: rtrim

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8260:
---

Assignee: Apache Spark  (was: Cheng Hao)

> string function: rtrim
> --
>
> Key: SPARK-8260
> URL: https://issues.apache.org/jira/browse/SPARK-8260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> rtrim(string A): string
> Returns the string resulting from trimming spaces from the end(right hand 
> side) of A. For example, rtrim(' foobar ') results in ' foobar'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8373) When an RDD has no partition, Python sum will throw "Can not reduce() empty RDD"

2015-06-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8373:
-
Priority: Minor  (was: Major)

> When an RDD has no partition, Python sum will throw "Can not reduce() empty 
> RDD"
> 
>
> Key: SPARK-8373
> URL: https://issues.apache.org/jira/browse/SPARK-8373
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Shixiong Zhu
>Priority: Minor
>
> The issue is because "sum" uses "reduce". Replacing it with "fold" will fix 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8373) When an RDD has no partition, Python sum will throw "Can not reduce() empty RDD"

2015-06-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585884#comment-14585884
 ] 

Sean Owen commented on SPARK-8373:
--

Really the same as SPARK-6878
https://github.com/apache/spark/commit/51b306b930cfe03ad21af72a3a6ef31e6e626235

> When an RDD has no partition, Python sum will throw "Can not reduce() empty 
> RDD"
> 
>
> Key: SPARK-8373
> URL: https://issues.apache.org/jira/browse/SPARK-8373
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Shixiong Zhu
>
> The issue is because "sum" uses "reduce". Replacing it with "fold" will fix 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6666) org.apache.spark.sql.jdbc.JDBCRDD does not escape/quote column names

2015-06-15 Thread Santiago M. Mola (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585883#comment-14585883
 ] 

Santiago M. Mola commented on SPARK-:
-

I opened SPARK-8377 to track the general case, since I have this problem with 
other data sources, not just JDBC.

> org.apache.spark.sql.jdbc.JDBCRDD  does not escape/quote column names
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment:  
>Reporter: John Ferguson
>Priority: Critical
>
> Is there a way to have JDBC DataFrames use quoted/escaped column names?  
> Right now, it looks like it "sees" the names correctly in the schema created 
> but does not escape them in the SQL it creates when they are not compliant:
> org.apache.spark.sql.jdbc.JDBCRDD
> 
> private val columnList: String = {
> val sb = new StringBuilder()
> columns.foreach(x => sb.append(",").append(x))
> if (sb.length == 0) "1" else sb.substring(1)
> }
> If you see value in this, I would take a shot at adding the quoting 
> (escaping) of column names here.  If you don't do it, some drivers... like 
> postgresql's will simply drop case all names when parsing the query.  As you 
> can see in the TL;DR below that means they won't match the schema I am given.
> TL;DR:
>  
> I am able to connect to a Postgres database in the shell (with driver 
> referenced):
>val jdbcDf = 
> sqlContext.jdbc("jdbc:postgresql://localhost/sparkdemo?user=dbuser", "sp500")
> In fact when I run:
>jdbcDf.registerTempTable("sp500")
>val avgEPSNamed = sqlContext.sql("SELECT AVG(`Earnings/Share`) as AvgCPI 
> FROM sp500")
> and
>val avgEPSProg = jsonDf.agg(avg(jsonDf.col("Earnings/Share")))
> The values come back as expected.  However, if I try:
>jdbcDf.show
> Or if I try
>
>val all = sqlContext.sql("SELECT * FROM sp500")
>all.show
> I get errors about column names not being found.  In fact the error includes 
> a mention of column names all lower cased.  For now I will change my schema 
> to be more restrictive.  Right now it is, per a Stack Overflow poster, not 
> ANSI compliant by doing things that are allowed by ""'s in pgsql, MySQL and 
> SQLServer.  BTW, our users are giving us tables like this... because various 
> tools they already use support non-compliant names.  In fact, this is mild 
> compared to what we've had to support.
> Currently the schema in question uses mixed case, quoted names with special 
> characters and spaces:
> CREATE TABLE sp500
> (
> "Symbol" text,
> "Name" text,
> "Sector" text,
> "Price" double precision,
> "Dividend Yield" double precision,
> "Price/Earnings" double precision,
> "Earnings/Share" double precision,
> "Book Value" double precision,
> "52 week low" double precision,
> "52 week high" double precision,
> "Market Cap" double precision,
> "EBITDA" double precision,
> "Price/Sales" double precision,
> "Price/Book" double precision,
> "SEC Filings" text
> ) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8377) Identifiers caseness information should be available at any time

2015-06-15 Thread Santiago M. Mola (JIRA)
Santiago M. Mola created SPARK-8377:
---

 Summary: Identifiers caseness information should be available at 
any time
 Key: SPARK-8377
 URL: https://issues.apache.org/jira/browse/SPARK-8377
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Santiago M. Mola


Currently, we have the option of having a case sensitive catalog or not. A case 
insensitive catalog just lowercases all identifiers. However, when pushing down 
to a data source, we lose the information about if an identifier should be case 
insensitive or strictly lowercase.

Ideally, we would be able to distinguish a case insensitive identifier from a 
case sensitive one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8375) BinaryClassificationMetrics in ML Lib has odd API

2015-06-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8375.
--
Resolution: Invalid

@sam This is a discussion for the mailing list rather than a JIRA.
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

You're looking at an API from 4 versions ago, too.
https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

The input are scores and ground-truth labels. I agree with the problem of many 
distinct values, but, this is part of the newer API.

> BinaryClassificationMetrics in ML Lib has odd API
> -
>
> Key: SPARK-8375
> URL: https://issues.apache.org/jira/browse/SPARK-8375
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: sam
>
> According to 
> https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
> The constructor takes `RDD[(Double, Double)]` which does not make sense it 
> should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.
> In scikit I believe they use the number of unique scores to determine the 
> number of thresholds and the ROC.  I assume this is what 
> BinaryClassificationMetrics is doing since it makes no mention of buckets.  
> In a Big Data context this does not make sense as the number of unique scores 
> may be huge.  
> Rather user should be able to either specify the number of buckets, or the 
> number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`
> Finally it would then be good if either the ROC output type was changed or 
> another method was added that returned confusion matricies, so that the hard 
> integer values can be obtained.  E.g.
> ```
> case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
>   // bunch of methods for each of the things in the table here 
> https://en.wikipedia.org/wiki/Receiver_operating_characteristic
> }
> ...
> def confusions(numPtsPerBucket: Int): RDD[Confusion]
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8376) Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8376:
---

Assignee: (was: Apache Spark)

> Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing 
> in the docs
> 
>
> Key: SPARK-8376
> URL: https://issues.apache.org/jira/browse/SPARK-8376
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Shixiong Zhu
>Priority: Minor
>
> Commons Lang 3 is added as one of the dependencies of Spark Flume Sink since 
> https://github.com/apache/spark/pull/5703. However, the docs has not yet 
> updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8376) Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs

2015-06-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8376:
---

Assignee: Apache Spark

> Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing 
> in the docs
> 
>
> Key: SPARK-8376
> URL: https://issues.apache.org/jira/browse/SPARK-8376
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> Commons Lang 3 is added as one of the dependencies of Spark Flume Sink since 
> https://github.com/apache/spark/pull/5703. However, the docs has not yet 
> updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8376) Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs

2015-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585833#comment-14585833
 ] 

Apache Spark commented on SPARK-8376:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/6829

> Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing 
> in the docs
> 
>
> Key: SPARK-8376
> URL: https://issues.apache.org/jira/browse/SPARK-8376
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Shixiong Zhu
>Priority: Minor
>
> Commons Lang 3 is added as one of the dependencies of Spark Flume Sink since 
> https://github.com/apache/spark/pull/5703. However, the docs has not yet 
> updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8376) Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs

2015-06-15 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-8376:
---

 Summary: Commons Lang 3 is one of the required JAR of Spark Flume 
Sink but is missing in the docs
 Key: SPARK-8376
 URL: https://issues.apache.org/jira/browse/SPARK-8376
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Shixiong Zhu
Priority: Minor


Commons Lang 3 is added as one of the dependencies of Spark Flume Sink since 
https://github.com/apache/spark/pull/5703. However, the docs has not yet 
updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8375) BinaryClassificationMetrics in ML Lib has odd API

2015-06-15 Thread sam (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-8375:
---
Description: 
According to 
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

The constructor takes `RDD[(Double, Double)]` which does not make sense it 
should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.

In scikit I believe they use the number of unique scores to determine the 
number of thresholds and the ROC.  I assume this is what 
BinaryClassificationMetrics is doing since it makes no mention of buckets.  In 
a Big Data context this does not make sense as the number of unique scores may 
be huge.  

Rather user should be able to either specify the number of buckets, or the 
number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`

Finally it would then be good if either the ROC output type was changed or 
another method was added that returned confusion matricies, so that the hard 
integer values can be obtained.  E.g.

```
case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
  // bunch of methods for each of the things in the table here 
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
}

...
def confusions(numPtsPerBucket: Int): RDD[Confusion]
```




  was:
According to 
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

The constructor takes `RDD[(Double, Double)]` which does not make sense it 
should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.

In scikit I believe they use the number of unique scores to determine the 
number of thresholds and the ROC.  I assume this is what 
BinaryClassificationMetrics is doing since it makes no mention of buckets.  In 
a Big Data context this does not make as the number of unique scores may be 
huge.  

Rather user should be able to either specify the number of buckets, or the 
number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`

Finally it would then be good if either the ROC output type was changed or 
another method was added that returned confusion matricies, so that the hard 
integer values can be obtained.  E.g.

```
case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
  // bunch of methods for each of the things in the table here 
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
}

...
def confusions(numPtsPerBucket: Int): RDD[Confusion]
```





> BinaryClassificationMetrics in ML Lib has odd API
> -
>
> Key: SPARK-8375
> URL: https://issues.apache.org/jira/browse/SPARK-8375
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: sam
>
> According to 
> https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
> The constructor takes `RDD[(Double, Double)]` which does not make sense it 
> should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.
> In scikit I believe they use the number of unique scores to determine the 
> number of thresholds and the ROC.  I assume this is what 
> BinaryClassificationMetrics is doing since it makes no mention of buckets.  
> In a Big Data context this does not make sense as the number of unique scores 
> may be huge.  
> Rather user should be able to either specify the number of buckets, or the 
> number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`
> Finally it would then be good if either the ROC output type was changed or 
> another method was added that returned confusion matricies, so that the hard 
> integer values can be obtained.  E.g.
> ```
> case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
>   // bunch of methods for each of the things in the table here 
> https://en.wikipedia.org/wiki/Receiver_operating_characteristic
> }
> ...
> def confusions(numPtsPerBucket: Int): RDD[Confusion]
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8375) BinaryClassificationMetrics in ML Lib has odd API

2015-06-15 Thread sam (JIRA)
sam created SPARK-8375:
--

 Summary: BinaryClassificationMetrics in ML Lib has odd API
 Key: SPARK-8375
 URL: https://issues.apache.org/jira/browse/SPARK-8375
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: sam


According to 
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

The constructor takes `RDD[(Double, Double)]` which does not make sense it 
should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.

In scikit I believe they use the number of unique scores to determine the 
number of thresholds and the ROC.  I assume this is what 
BinaryClassificationMetrics is doing since it makes no mention of buckets.  In 
a Big Data context this does not make as the number of unique scores may be 
huge.  

Rather user should be able to either specify the number of buckets, or the 
number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`

Finally it would then be good if either the ROC output type was changed or 
another method was added that returned confusion matricies, so that the hard 
integer values can be obtained.  E.g.

```
case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
  // bunch of methods for each of the things in the table here 
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
}

...
def confusions(numPtsPerBucket: Int): RDD[Confusion]
```






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2898) Failed to connect to daemon

2015-06-15 Thread Peter Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585809#comment-14585809
 ] 

Peter Taylor commented on SPARK-2898:
-

FYI 

java.io.IOException: Cannot run program "python": error=316, Unknown error: 316

I have seen this error to occur on mac because lib/jspawnhelper is missing 
execute permissions in your jre.

> Failed to connect to daemon
> ---
>
> Key: SPARK-2898
> URL: https://issues.apache.org/jira/browse/SPARK-2898
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.1.0
>
>
> There is a deadlock  in handle_sigchld() because of logging
> 
> Java options: -Dspark.storage.memoryFraction=0.66 
> -Dspark.serializer=org.apache.spark.serializer.JavaSerializer 
> -Dspark.executor.memory=3g -Dspark.locality.wait=6000
> Options: SchedulerThroughputTest --num-tasks=1 --num-trials=4 
> --inter-trial-wait=1
> 
> 14/08/06 22:09:41 WARN JettyUtils: Failed to create UI on port 4040. Trying 
> again on port 4041. - Failure(java.net.BindException: Address already in use)
> worker 50114 crashed abruptly with exit status 1
> 14/08/06 22:10:37 ERROR Executor: Exception in task 1476.0 in stage 1.0 (TID 
> 11476)
> org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:150)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:154)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.EOFException
>   at java.io.DataInputStream.readInt(DataInputStream.java:392)
>   at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:101)
>   ... 10 more
> 14/08/06 22:10:37 WARN PythonWorkerFactory: Failed to open socket to Python 
> daemon:
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:579)
>   at java.net.Socket.connect(Socket.java:528)
>   at java.net.Socket.(Socket.java:425)
>   at java.net.Socket.(Socket.java:241)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:68)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55)
>   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/08/06 22:10:37 ERROR Executor: Exception in task 1478.0 in stage 1.0 (TID 
> 11478)
> java.io.EOFException
>   at java.io.DataInputStream.readInt(DataInputStream.java:392)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:69)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.li

[jira] [Created] (SPARK-8374) Job frequently hangs after YARN preemption

2015-06-15 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-8374:


 Summary: Job frequently hangs after YARN preemption
 Key: SPARK-8374
 URL: https://issues.apache.org/jira/browse/SPARK-8374
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
Reporter: Shay Rojansky
Priority: Critical


After upgrading to Spark 1.4.0, jobs that get preempted very frequently will 
not reacquire executors and will therefore hang. To reproduce:

1. I run Spark job A that acquires all grid resources
2. I run Spark job B in a higher-priority queue that acquires all grid 
resources. Job A is fully preempted.
3. Kill job B, releasing all resources
4. Job A should at this point reacquire all grid resources, but occasionally 
doesn't. Repeating the preemption scenario makes the bad behavior occur within 
a few attempts.

(see logs at bottom).

Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption 
issues, maybe the work there is related to the new issues.

The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've 
downgraded to 1.3.1 just because of this issue).

Logs
--
When job B (the preemptor first acquires an application master, the following 
is logged by job A (the preemptee):

{noformat}
ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc 
client disassociated
INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
WARN ReliableDeliverySupervisor: Association with remote system 
[akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address is 
now gated for [5000] ms. Reason is: [Disassociated].
WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, 
g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
INFO DAGScheduler: Executor lost: 447 (epoch 0)
INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from 
BlockManagerMaster.
INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, 
g023.grid.eaglerd.local, 41406)
INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
{noformat}

(It's strange for errors/warnings to be logged for preemption)

Later, when job B's AM starts requesting its resources, I get lots of the 
following in job A:

{noformat}
ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc 
client disassociated
INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, 
g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
WARN ReliableDeliverySupervisor: Association with remote system 
[akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address is 
now gated for [5000] ms. Reason is: [Disassociated].
{noformat}

Finally, when I kill job B, job A emits lots of the following:

{noformat}
INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
{noformat}

And finally after some time:

{noformat}
WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 165964 
ms exceeds timeout 12 ms
ERROR YarnScheduler: Lost an executor 466 (already removed): Executor heartbeat 
timed out after 165964 ms
{noformat}

At this point the job never requests/acquires more resources and hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >