[jira] [Assigned] (SPARK-6941) Provide a better error message to explain that tables created from RDDs are immutable

2015-07-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6941:
---

Assignee: Yijie Shen  (was: Apache Spark)

 Provide a better error message to explain that tables created from RDDs are 
 immutable
 -

 Key: SPARK-6941
 URL: https://issues.apache.org/jira/browse/SPARK-6941
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yijie Shen
Priority: Blocker

 We should explicitly let users know that tables created from RDDs are 
 immutable and new rows cannot be inserted into it. We can add a better error 
 message and also explain it in the programming guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6941) Provide a better error message to explain that tables created from RDDs are immutable

2015-07-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6941:
---

Assignee: Apache Spark  (was: Yijie Shen)

 Provide a better error message to explain that tables created from RDDs are 
 immutable
 -

 Key: SPARK-6941
 URL: https://issues.apache.org/jira/browse/SPARK-6941
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Apache Spark
Priority: Blocker

 We should explicitly let users know that tables created from RDDs are 
 immutable and new rows cannot be inserted into it. We can add a better error 
 message and also explain it in the programming guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8951) support CJK characters in collect()

2015-07-10 Thread Jaehong Choi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621834#comment-14621834
 ] 

Jaehong Choi commented on SPARK-8951:
-

Thanks for your advice.
You're right. This is about supporting Unicode indeed. I'll open a PR for this 
issue. 
I didn't think about the null terminator much. I am going to figure it out as 
well. 

 support CJK characters in collect()
 ---

 Key: SPARK-8951
 URL: https://issues.apache.org/jira/browse/SPARK-8951
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Jaehong Choi
Priority: Minor
 Attachments: SerDe.scala.diff


 Spark gives an error message and does not show the output when a field of the 
 result DataFrame contains characters in CJK.
 I found out that SerDe in R API only supports ASCII format for strings right 
 now as commented in source code.  
 So, I fixed SerDe.scala a little to support CJK as the file attached. 
 I did not care efficiency, but just wanted to see if it works.
 {noformat}
 people.json
 {name:가나}
 {name:테스트123, age:30}
 {name:Justin, age:19}
 df - read.df(sqlContext, ./people.json, json)
 head(df)
 Error in rawtochar(string) : embedded nul in string : '\0 \x98'
 {noformat}
 {code:title=core/src/main/scala/org/apache/spark/api/r/SerDe.scala}
   // NOTE: Only works for ASCII right now
   def writeString(out: DataOutputStream, value: String): Unit = {
 val len = value.length
 out.writeInt(len + 1) // For the \0
 out.writeBytes(value)
 out.writeByte(0)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8923) Add @since tags to mllib.fpm

2015-07-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621821#comment-14621821
 ] 

Apache Spark commented on SPARK-8923:
-

User 'rahulpalamuttam' has created a pull request for this issue:
https://github.com/apache/spark/pull/7341

 Add @since tags to mllib.fpm
 

 Key: SPARK-8923
 URL: https://issues.apache.org/jira/browse/SPARK-8923
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter
   Original Estimate: 0.5h
  Remaining Estimate: 0.5h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8923) Add @since tags to mllib.fpm

2015-07-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8923:
---

Assignee: (was: Apache Spark)

 Add @since tags to mllib.fpm
 

 Key: SPARK-8923
 URL: https://issues.apache.org/jira/browse/SPARK-8923
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter
   Original Estimate: 0.5h
  Remaining Estimate: 0.5h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2015-07-10 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621832#comment-14621832
 ] 

Shay Rojansky commented on SPARK-7736:
--

Neelesh, not sure I understood what you're saying exactly... I agree with Esben 
that at the end of the day, if a Spark application fails (by throwing an 
exception), and does so on all Yarn application attempts, that the Yarn status 
of that application definitely should be FAILED...

 Exception not failing Python applications (in yarn cluster mode)
 

 Key: SPARK-7736
 URL: https://issues.apache.org/jira/browse/SPARK-7736
 Project: Spark
  Issue Type: Bug
  Components: YARN
 Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
Reporter: Shay Rojansky

 It seems that exceptions thrown in Python spark apps after the SparkContext 
 is instantiated don't cause the application to fail, at least in Yarn: the 
 application is marked as SUCCEEDED.
 Note that any exception right before the SparkContext correctly places the 
 application in FAILED state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8973) Spark Executor usage Cpu 100+%

2015-07-10 Thread Xu Chen (JIRA)
Xu Chen created SPARK-8973:
--

 Summary: Spark Executor usage Cpu 100+% 
 Key: SPARK-8973
 URL: https://issues.apache.org/jira/browse/SPARK-8973
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Xu Chen


Spark Executor usage Cpu 100+%  

Use Spark-Sql-CLI to count a CACHE TABLE , when I look out the top command I 
got some Cpu 100+%  processes that Spark Executors 
when I use jstack to check it I found this thread 
{code:java}
   Executor task launch worker-1 daemon prio=10 tid=0x7fc9983eb000 
nid=0x2f3 runnable [0x7fc9893f9000]
   java.lang.Thread.State: RUNNABLE
at scala.collection.mutable.HashMap.update(HashMap.scala:80)
at 
org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.gatherCompressibilityStats(compressionSchemes.scala:233)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.gatherCompressibilityStats(CompressibleColumnBuilder.scala:72)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:80)
at 
org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:148)
at 
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:124)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
at 
org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
at 
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1187)
at 
org.apache.spark.storage.DiskStore$$anonfun$putIterator$1.apply$mcV$sp(DiskStore.scala:81)
at 
org.apache.spark.storage.DiskStore$$anonfun$putIterator$1.apply(DiskStore.scala:81)
at 
org.apache.spark.storage.DiskStore$$anonfun$putIterator$1.apply(DiskStore.scala:81)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:82)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:788)
- locked 0x0007a9471e30 (a org.apache.spark.storage.BlockInfo)
at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:635)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:153)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
 
{code}







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6941) Provide a better error message to explain that tables created from RDDs are immutable

2015-07-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621827#comment-14621827
 ] 

Apache Spark commented on SPARK-6941:
-

User 'yijieshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7342

 Provide a better error message to explain that tables created from RDDs are 
 immutable
 -

 Key: SPARK-6941
 URL: https://issues.apache.org/jira/browse/SPARK-6941
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yijie Shen
Priority: Blocker

 We should explicitly let users know that tables created from RDDs are 
 immutable and new rows cannot be inserted into it. We can add a better error 
 message and also explain it in the programming guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2015-07-10 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621830#comment-14621830
 ] 

zhengruifeng commented on SPARK-7008:
-

Yes, LBFGS provide a faster convergence rate.

 An implementation of Factorization Machine (LibFM)
 --

 Key: SPARK-7008
 URL: https://issues.apache.org/jira/browse/SPARK-7008
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: zhengruifeng
  Labels: features
 Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, 
 QQ20150421-2.png


 An implementation of Factorization Machines based on Scala and Spark MLlib.
 FM is a kind of machine learning algorithm for multi-linear regression, and 
 is widely used for recommendation.
 FM works well in recent years' recommendation competitions.
 Ref:
 http://libfm.org/
 http://doi.acm.org/10.1145/2168752.2168771
 http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7751) Add @since to stable and experimental methods in MLlib

2015-07-10 Thread Patrick Baier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14620486#comment-14620486
 ] 

Patrick Baier edited comment on SPARK-7751 at 7/10/15 7:19 AM:
---

I built this short bash script to search for the version of methods:
{code:borderStyle=solid}
#$1=sourceFile to search
#$2=string to search for
versions=$(git tag)
for v in $versions
do
echo Checking version $v
versionedFile=$(git show $v:$1)
matches=$(echo $versionedFile | grep -c $2)
if [ $matches -gt 0 ]
then
echo Introduced in version $v
exit 0 
fi
done
echo search string $2 not found!
{code}

Note: You must be in the spark home directory to run it.
Example usage:
$1=mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
$2=override protected def createModel



was (Author: pbaier):
I built this short batch script here to search for the version of methods:
{code:borderStyle=solid}
#$1=sourceFile to search
#$2=string to search for
versions=$(git tag)
for v in $versions
do
echo Checking version $v
versionedFile=$(git show $v:$1)
matches=$(echo $versionedFile | grep -c $2)
if [ $matches -gt 0 ]
then
echo Introduced in version $v
exit 0 
fi
done
echo search string $2 not found!
{code}

Note: You must be in the spark home directory to run it.
Example usage:
$1=mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
$2=override protected def createModel


 Add @since to stable and experimental methods in MLlib
 --

 Key: SPARK-7751
 URL: https://issues.apache.org/jira/browse/SPARK-7751
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor
  Labels: starter

 This is useful to check whether a feature exists in some version of Spark. 
 This is an umbrella JIRA to track the progress. We want to have @since tag 
 for both stable (those without any Experimental/DeveloperApi/AlphaComponent 
 annotations) and experimental methods in MLlib:
 * an example PR for Scala: https://github.com/apache/spark/pull/6101
 * an example PR for Python: https://github.com/apache/spark/pull/6295
 We need to dig the history of git commit to figure out what was the Spark 
 version when a method was first introduced. Take `NaiveBayes.setModelType` as 
 an example. We can grep `def setModelType` at different version git tags.
 {code}
 meng@xm:~/src/spark
 $ git show 
 v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
 meng@xm:~/src/spark
 $ git show 
 v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
   def setModelType(modelType: String): NaiveBayes = {
 {code}
 If there are better ways, please let us know.
 We cannot add all @since tags in a single PR, which is hard to review. So we 
 made some subtasks for each package, for example 
 `org.apache.spark.classification`. Feel free to add more sub-tasks for Python 
 and the `spark.ml` package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6051) Add an option for DirectKafkaInputDStream to commit the offsets into ZK

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6051.
--
Resolution: Won't Fix

 Add an option for DirectKafkaInputDStream to commit the offsets into ZK
 ---

 Key: SPARK-6051
 URL: https://issues.apache.org/jira/browse/SPARK-6051
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao

 Currently in DirectKafkaInputDStream, offset is managed by Spark Streaming  
 itself without ZK or Kafka involved, which will make several third-party 
 offset monitoring tools fail to monitor the status of Kafka consumer. So here 
 as a option to commit the offset to ZK when each job is finished, the process 
 is implemented as a asynchronized way, so the main processing flow will not 
 be blocked, already tested with KafkaOffsetMonitor tools.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3219.
--
Resolution: Won't Fix

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns
  Labels: clustering

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8972) Incorrect result for rollup

2015-07-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8972:
---

Assignee: Apache Spark

 Incorrect result for rollup
 ---

 Key: SPARK-8972
 URL: https://issues.apache.org/jira/browse/SPARK-8972
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Apache Spark
Priority: Critical

 {code:java}
 import sqlContext.implicits._
 case class KeyValue(key: Int, value: String)
 val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF
 df.registerTempTable(foo)
 sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group 
 by key%100 with rollup).show(100)
 // output
 +---+---++
 |cnt|_c1|GROUPING__ID|
 +---+---++
 |  1|  4|   0|
 |  1|  4|   1|
 |  1|  5|   0|
 |  1|  5|   1|
 |  1|  1|   0|
 |  1|  1|   1|
 |  1|  2|   0|
 |  1|  2|   1|
 |  1|  3|   0|
 |  1|  3|   1|
 +---+---++
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8973) Spark Executor usage Cpu 100+%

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8973.
--
  Resolution: Not A Problem
Target Version/s:   (was: 1.4.0)

Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Just having a busy executor is not a problem. You'd have to state a clearer 
problem.

 Spark Executor usage Cpu 100+% 
 ---

 Key: SPARK-8973
 URL: https://issues.apache.org/jira/browse/SPARK-8973
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Xu Chen

 Spark Executor usage Cpu 100+%  
 Use Spark-Sql-CLI to count a CACHE TABLE , when I look out the top command I 
 got some Cpu 100+%  processes that Spark Executors 
 when I use jstack to check it I found this thread 
 {code:java}
Executor task launch worker-1 daemon prio=10 tid=0x7fc9983eb000 
 nid=0x2f3 runnable [0x7fc9893f9000]
java.lang.Thread.State: RUNNABLE
   at scala.collection.mutable.HashMap.update(HashMap.scala:80)
   at 
 org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.gatherCompressibilityStats(compressionSchemes.scala:233)
   at 
 org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.gatherCompressibilityStats(CompressibleColumnBuilder.scala:72)
   at 
 org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:80)
   at 
 org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:148)
   at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:124)
   at scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
   at 
 org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
   at 
 org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1187)
   at 
 org.apache.spark.storage.DiskStore$$anonfun$putIterator$1.apply$mcV$sp(DiskStore.scala:81)
   at 
 org.apache.spark.storage.DiskStore$$anonfun$putIterator$1.apply(DiskStore.scala:81)
   at 
 org.apache.spark.storage.DiskStore$$anonfun$putIterator$1.apply(DiskStore.scala:81)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
   at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:82)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:788)
   - locked 0x0007a9471e30 (a org.apache.spark.storage.BlockInfo)
   at 
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:635)
   at 
 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:153)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
  
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8972) Incorrect result for rollup

2015-07-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621929#comment-14621929
 ] 

Apache Spark commented on SPARK-8972:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/7343

 Incorrect result for rollup
 ---

 Key: SPARK-8972
 URL: https://issues.apache.org/jira/browse/SPARK-8972
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Critical

 {code:java}
 import sqlContext.implicits._
 case class KeyValue(key: Int, value: String)
 val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF
 df.registerTempTable(foo)
 sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group 
 by key%100 with rollup).show(100)
 // output
 +---+---++
 |cnt|_c1|GROUPING__ID|
 +---+---++
 |  1|  4|   0|
 |  1|  4|   1|
 |  1|  5|   0|
 |  1|  5|   1|
 |  1|  1|   0|
 |  1|  1|   1|
 |  1|  2|   0|
 |  1|  2|   1|
 |  1|  3|   0|
 |  1|  3|   1|
 +---+---++
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8968) dynamic partitioning in spark sql performance issue due to the high GC overhead

2015-07-10 Thread Fei Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang updated SPARK-8968:

Summary: dynamic partitioning in spark sql performance issue due to the 
high GC overhead  (was: shuffled by the partition clomns when dynamic 
partitioning to optimize the memory overhead)

 dynamic partitioning in spark sql performance issue due to the high GC 
 overhead
 ---

 Key: SPARK-8968
 URL: https://issues.apache.org/jira/browse/SPARK-8968
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Fei Wang

 now the dynamic partitioning show the bad performance for big data due to the 
 GC/memory overhead.  this is because each task each partition now we open a 
 writer to write the data, this will cause many small files and high GC. We 
 can shuffle data by the partition columns so that each partition will have 
 ony one partition file and this also reduce the gc overhead  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7018) Refactor dev/run-tests-jenkins into Python

2015-07-10 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621885#comment-14621885
 ] 

Josh Rosen commented on SPARK-7018:
---

Hey [~boyork], any updates on this refactoring?  If you're interested, I'm 
available to chat to discuss whether we should break this up into a series of 
smaller incremental subtasks (e.g. leaving the `dev/tests` or some of the 
linting integration scripts as bash for now).  The Jenkins script has become 
moderately complicated so we may need to think about whether we need to do any 
re-architecting as part of this refactoring.

 Refactor dev/run-tests-jenkins into Python
 --

 Key: SPARK-7018
 URL: https://issues.apache.org/jira/browse/SPARK-7018
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Project Infra
Reporter: Brennon York

 This issue is to specifically track the progress of the 
 {{dev/run-tests-jenkins}} script into Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6882) Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]

2015-07-10 Thread Ma Xiaoyu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621900#comment-14621900
 ] 

Ma Xiaoyu commented on SPARK-6882:
--

Can you try add in spark-env.sh's classpath and make sure it stay before other 
jars.

 Spark ThriftServer2 Kerberos failed encountering 
 java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
 are: [auth-int, auth-conf, auth]
 

 Key: SPARK-6882
 URL: https://issues.apache.org/jira/browse/SPARK-6882
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1, 1.3.0, 1.4.0
 Environment: * Apache Hadoop 2.4.1 with Kerberos Enabled
 * Apache Hive 0.13.1
 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
Reporter: Andrew Lee

 When Kerberos is enabled, I get the following exceptions. 
 {code}
 2015-03-13 18:26:05,363 ERROR 
 org.apache.hive.service.cli.thrift.ThriftCLIService 
 (ThriftBinaryCLIService.java:run(93)) - Error: 
 java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
 are: [auth-int, auth-conf, auth]
 {code}
 I tried it in
 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
 with
 * Apache Hive 0.13.1
 * Apache Hadoop 2.4.1
 Build command
 {code}
 mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver 
 -Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests 
 install
 {code}
 When starting Spark ThriftServer in {{yarn-client}} mode, the command to 
 start thriftserver looks like this
 {code}
 ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf 
 hive.server2.thrift.bind.host=$(hostname) --master yarn-client
 {code}
 {{hostname}} points to the current hostname of the machine I'm using.
 Error message in {{spark.log}} from Spark 1.2.1 (1.2 rc1)
 {code}
 2015-03-13 18:26:05,363 ERROR 
 org.apache.hive.service.cli.thrift.ThriftCLIService 
 (ThriftBinaryCLIService.java:run(93)) - Error: 
 java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
 are: [auth-int, auth-conf, auth]
 at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56)
 at 
 org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118)
 at 
 org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133)
 at 
 org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 I'm wondering if this is due to the same problem described in HIVE-8154 
 HIVE-7620 due to an older code based for the Spark ThriftServer?
 Any insights are appreciated. Currently, I can't get Spark ThriftServer2 to 
 run against a Kerberos cluster (Apache 2.4.1).
 My hive-site.xml looks like the following for spark/conf.
 The kerberos keytab and tgt are configured correctly, I'm able to connect to 
 metastore, but the subsequent steps failed due to the exception.
 {code}
 property
   namehive.semantic.analyzer.factory.impl/name
   valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value
 /property
 property
   namehive.metastore.execute.setugi/name
   valuetrue/value
 /property
 property
   namehive.stats.autogather/name
   valuefalse/value
 /property
 property
   namehive.session.history.enabled/name
   valuetrue/value
 /property
 property
   namehive.querylog.location/name
   value/tmp/home/hive/log/${user.name}/value
 /property
 property
   namehive.exec.local.scratchdir/name
   value/tmp/hive/scratch/${user.name}/value
 /property
 property
   namehive.metastore.uris/name
   valuethrift://somehostname:9083/value
 /property
 !-- HIVE SERVER 2 --
 property
   namehive.server2.authentication/name
   valueKERBEROS/value
 /property
 property
   namehive.server2.authentication.kerberos.principal/name
   value***/value
 /property
 property
   namehive.server2.authentication.kerberos.keytab/name
   value***/value
 /property
 property
   namehive.server2.thrift.sasl.qop/name
   valueauth/value
   descriptionSasl QOP value; one of 'auth', 'auth-int' and 
 'auth-conf'/description
 /property
 property
   namehive.server2.enable.impersonation/name
   descriptionEnable user impersonation for HiveServer2/description
   valuetrue/value
 /property
 !-- HIVE METASTORE --
 property
   namehive.metastore.sasl.enabled/name
   valuetrue/value
 /property
 property
   namehive.metastore.kerberos.keytab.file/name
   value***/value
 /property
 property
   namehive.metastore.kerberos.principal/name
   

[jira] [Commented] (SPARK-6882) Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]

2015-07-10 Thread Ma Xiaoyu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621899#comment-14621899
 ] 

Ma Xiaoyu commented on SPARK-6882:
--

Can you try add in spark-env.sh's classpath and make sure it stay before other 
jars.

 Spark ThriftServer2 Kerberos failed encountering 
 java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
 are: [auth-int, auth-conf, auth]
 

 Key: SPARK-6882
 URL: https://issues.apache.org/jira/browse/SPARK-6882
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1, 1.3.0, 1.4.0
 Environment: * Apache Hadoop 2.4.1 with Kerberos Enabled
 * Apache Hive 0.13.1
 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
Reporter: Andrew Lee

 When Kerberos is enabled, I get the following exceptions. 
 {code}
 2015-03-13 18:26:05,363 ERROR 
 org.apache.hive.service.cli.thrift.ThriftCLIService 
 (ThriftBinaryCLIService.java:run(93)) - Error: 
 java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
 are: [auth-int, auth-conf, auth]
 {code}
 I tried it in
 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
 with
 * Apache Hive 0.13.1
 * Apache Hadoop 2.4.1
 Build command
 {code}
 mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver 
 -Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests 
 install
 {code}
 When starting Spark ThriftServer in {{yarn-client}} mode, the command to 
 start thriftserver looks like this
 {code}
 ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf 
 hive.server2.thrift.bind.host=$(hostname) --master yarn-client
 {code}
 {{hostname}} points to the current hostname of the machine I'm using.
 Error message in {{spark.log}} from Spark 1.2.1 (1.2 rc1)
 {code}
 2015-03-13 18:26:05,363 ERROR 
 org.apache.hive.service.cli.thrift.ThriftCLIService 
 (ThriftBinaryCLIService.java:run(93)) - Error: 
 java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
 are: [auth-int, auth-conf, auth]
 at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56)
 at 
 org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118)
 at 
 org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133)
 at 
 org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 I'm wondering if this is due to the same problem described in HIVE-8154 
 HIVE-7620 due to an older code based for the Spark ThriftServer?
 Any insights are appreciated. Currently, I can't get Spark ThriftServer2 to 
 run against a Kerberos cluster (Apache 2.4.1).
 My hive-site.xml looks like the following for spark/conf.
 The kerberos keytab and tgt are configured correctly, I'm able to connect to 
 metastore, but the subsequent steps failed due to the exception.
 {code}
 property
   namehive.semantic.analyzer.factory.impl/name
   valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value
 /property
 property
   namehive.metastore.execute.setugi/name
   valuetrue/value
 /property
 property
   namehive.stats.autogather/name
   valuefalse/value
 /property
 property
   namehive.session.history.enabled/name
   valuetrue/value
 /property
 property
   namehive.querylog.location/name
   value/tmp/home/hive/log/${user.name}/value
 /property
 property
   namehive.exec.local.scratchdir/name
   value/tmp/hive/scratch/${user.name}/value
 /property
 property
   namehive.metastore.uris/name
   valuethrift://somehostname:9083/value
 /property
 !-- HIVE SERVER 2 --
 property
   namehive.server2.authentication/name
   valueKERBEROS/value
 /property
 property
   namehive.server2.authentication.kerberos.principal/name
   value***/value
 /property
 property
   namehive.server2.authentication.kerberos.keytab/name
   value***/value
 /property
 property
   namehive.server2.thrift.sasl.qop/name
   valueauth/value
   descriptionSasl QOP value; one of 'auth', 'auth-int' and 
 'auth-conf'/description
 /property
 property
   namehive.server2.enable.impersonation/name
   descriptionEnable user impersonation for HiveServer2/description
   valuetrue/value
 /property
 !-- HIVE METASTORE --
 property
   namehive.metastore.sasl.enabled/name
   valuetrue/value
 /property
 property
   namehive.metastore.kerberos.keytab.file/name
   value***/value
 /property
 property
   namehive.metastore.kerberos.principal/name
   

[jira] [Commented] (SPARK-8968) shuffled by the partition clomns when dynamic partitioning to optimize the memory overhead

2015-07-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621923#comment-14621923
 ] 

Sean Owen commented on SPARK-8968:
--

[~scwf] can you reword this? I can't make out what your'e describing in the 
title or description.

 shuffled by the partition clomns when dynamic partitioning to optimize the 
 memory overhead
 --

 Key: SPARK-8968
 URL: https://issues.apache.org/jira/browse/SPARK-8968
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Fei Wang

 now the dynamic partitioning show the bad performance for big data due to the 
 GC/memory overhead.  this is because each task each partition now we open a 
 writer to write the data, this will cause many small files and high GC. We 
 can shuffle data by the partition columns so that each partition will have 
 ony one partition file and this also reduce the gc overhead  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8972) Incorrect result for rollup

2015-07-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8972:
---

Assignee: (was: Apache Spark)

 Incorrect result for rollup
 ---

 Key: SPARK-8972
 URL: https://issues.apache.org/jira/browse/SPARK-8972
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Critical

 {code:java}
 import sqlContext.implicits._
 case class KeyValue(key: Int, value: String)
 val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF
 df.registerTempTable(foo)
 sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group 
 by key%100 with rollup).show(100)
 // output
 +---+---++
 |cnt|_c1|GROUPING__ID|
 +---+---++
 |  1|  4|   0|
 |  1|  4|   1|
 |  1|  5|   0|
 |  1|  5|   1|
 |  1|  1|   0|
 |  1|  1|   1|
 |  1|  2|   0|
 |  1|  2|   1|
 |  1|  3|   0|
 |  1|  3|   1|
 +---+---++
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8839) Thrift Sever will throw `java.util.NoSuchElementException: key not found` exception when many clients connect it

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8839:
-
Assignee: SaintBacchus

 Thrift Sever will throw `java.util.NoSuchElementException: key not found` 
 exception when  many clients connect it
 -

 Key: SPARK-8839
 URL: https://issues.apache.org/jira/browse/SPARK-8839
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: SaintBacchus
Assignee: SaintBacchus
 Fix For: 1.5.0


 If there are about 150+ JDBC clients connectting to the Thrift Server,  some 
 clients will throw such exception:
 {code:title=Exception message|borderStyle=solid}
 java.sql.SQLException: java.util.NoSuchElementException: key not found: 
 90d93e56-7f6d-45bf-b340-e3ee09dd60fc
  at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:155)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8968) dynamic partitioning in spark sql performance issue due to the high GC overhead

2015-07-10 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621935#comment-14621935
 ] 

Fei Wang commented on SPARK-8968:
-

changed, how about this?

 dynamic partitioning in spark sql performance issue due to the high GC 
 overhead
 ---

 Key: SPARK-8968
 URL: https://issues.apache.org/jira/browse/SPARK-8968
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Fei Wang

 now the dynamic partitioning show the bad performance for big data due to the 
 GC/memory overhead.  this is because each task each partition now we open a 
 writer to write the data, this will cause many small files and high GC. We 
 can shuffle data by the partition columns so that each partition will have 
 ony one partition file and this also reduce the gc overhead  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8940) Don't overwrite given schema if it is not null for createDataFrame in SparkR

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8940:
-
Assignee: Liang-Chi Hsieh

 Don't overwrite given schema if it is not null for createDataFrame in SparkR
 

 Key: SPARK-8940
 URL: https://issues.apache.org/jira/browse/SPARK-8940
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
 Fix For: 1.5.0


 The given schema parameter will be overwritten in createDataFrame now. If it 
 is not null, we shouldn't overwrite it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8830) levenshtein directly on top of UTF8String

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8830:
-
Assignee: Tarek Auel

 levenshtein directly on top of UTF8String
 -

 Key: SPARK-8830
 URL: https://issues.apache.org/jira/browse/SPARK-8830
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Tarek Auel
 Fix For: 1.5.0


 We currently rely on commons-lang's levenshtein implementation. Ideally, we 
 should have our own implementation to:
 1. Reduce external dependency
 2. Work directly against UTF8String so we don't need to convert to/from 
 java.lang.String back and forth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3155) Support DecisionTree pruning

2015-07-10 Thread Walter Petersen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622041#comment-14622041
 ] 

Walter Petersen commented on SPARK-3155:


Hi all,

I'm new out there. Please tell me:
- Is the proposed implementation based on a well-known research paper ? If so, 
which one ?
- Is is issue still relevant ? Is someone currently implementing the feature ? 

Thanks

 Support DecisionTree pruning
 

 Key: SPARK-3155
 URL: https://issues.apache.org/jira/browse/SPARK-3155
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 Improvement: accuracy, computation
 Summary: Pruning is a common method for preventing overfitting with decision 
 trees.  A smart implementation can prune the tree during training in order to 
 avoid training parts of the tree which would be pruned eventually anyways.  
 DecisionTree does not currently support pruning.
 Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
 with zero or more branches removed.
 A naive implementation prunes as follows:
 (1) Train a depth K tree using a training set.
 (2) Compute the optimal prediction at each node (including internal nodes) 
 based on the training set.
 (3) Take a held-out validation set, and use the tree to make predictions for 
 each validation example.  This allows one to compute the validation error 
 made at each node in the tree (based on the predictions computed in step (2).)
 (4) For each pair of leafs with the same parent, compare the total error on 
 the validation set made by the leafs’ predictions with the error made by the 
 parent’s predictions.  Remove the leafs if the parent has lower error.
 A smarter implementation prunes during training, computing the error on the 
 validation set made by each node as it is trained.  Whenever two children 
 increase the validation error, they are pruned, and no more training is 
 required on that branch.
 It is common to use about 1/3 of the data for pruning.  Note that pruning is 
 important when using a tree directly for prediction.  It is less important 
 when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8974) The spark-dynamic-executor-allocation may be not supported

2015-07-10 Thread KaiXinXIaoLei (JIRA)
KaiXinXIaoLei created SPARK-8974:


 Summary: The spark-dynamic-executor-allocation may be not supported
 Key: SPARK-8974
 URL: https://issues.apache.org/jira/browse/SPARK-8974
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei
 Fix For: 1.4.1


In yarn-client mode and config option spark.dynamicAllocation.enabled  is 
true, when the state of ApplicationMaster is dead or disconnected, if the tasks 
are submitted  before new ApplicationMaster start. The thread of 
spark-dynamic-executor-allocation will throw exception, So feture of 
dynamicAllocation are not  supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8974) The spark-dynamic-executor-allocation may be not supported

2015-07-10 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-8974:
-
Description: In yarn-client mode and config option 
spark.dynamicAllocation.enabled  is true, when the state of ApplicationMaster 
is dead or disconnected, if the tasks are submitted  before new 
ApplicationMaster start. The thread of spark-dynamic-executor-allocation will 
throw exception, When ApplicationMaster is running and not tasks are running, 
the number of executor is not zero. So feture of dynamicAllocation are not  
supported.  (was: In yarn-client mode and config option 
spark.dynamicAllocation.enabled  is true, when the state of ApplicationMaster 
is dead or disconnected, if the tasks are submitted  before new 
ApplicationMaster start. The thread of spark-dynamic-executor-allocation will 
throw exception, So feture of dynamicAllocation are not  supported.)

 The spark-dynamic-executor-allocation may be not supported
 --

 Key: SPARK-8974
 URL: https://issues.apache.org/jira/browse/SPARK-8974
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei
 Fix For: 1.4.1


 In yarn-client mode and config option spark.dynamicAllocation.enabled  is 
 true, when the state of ApplicationMaster is dead or disconnected, if the 
 tasks are submitted  before new ApplicationMaster start. The thread of 
 spark-dynamic-executor-allocation will throw exception, When 
 ApplicationMaster is running and not tasks are running, the number of 
 executor is not zero. So feture of dynamicAllocation are not  supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8843) DStream transform function receives null instead of RDD

2015-07-10 Thread Vincent Debergue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622071#comment-14622071
 ] 

Vincent Debergue commented on SPARK-8843:
-

Fine for me, I'll use the emptyRDD instead. Thanks for looking into that.

 DStream transform function receives null instead of RDD
 ---

 Key: SPARK-8843
 URL: https://issues.apache.org/jira/browse/SPARK-8843
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Vincent Debergue

 When using the {{transform}} function on a {{DStream}} with empty values, we 
 can get a {{NullPointerException}} 
 You can reproduce the issue with this piece of code in the spark-shell:
 {code}
 import org.apache.spark.streaming.dstream.InputDStream
 import org.apache.spark.streaming._
 import scala.reflect.ClassTag
 class EmptyDStream[T: scala.reflect.ClassTag](ssc: 
 org.apache.spark.streaming.StreamingContext) extends InputDStream[T](ssc) {
   override def compute(t: org.apache.spark.streaming.Time) = None
   override def start() = {}
   override def stop() = {}
 }
 val ssc = new StreamingContext(sc, Seconds(2))
 val in = new EmptyDStream[Int](ssc)
 val out = in.transform { rdd =
   rdd.map(_ + 1) // rdd is in fact null here
 }
 out.print()
 ssc.start() // NullPointerException
 {code}
 This bug is very likely to come from the usage of {{orNull}} on the scala 
 {{Option}}: 
 https://github.com/apache/spark/blob/branch-1.4/streaming/src/main/scala/org/apache/spark/streaming/dstream/TransformedDStream.scala#L40



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8820) Add a configuration to set the checkpoint directory for convenience.

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8820:
-
Affects Version/s: (was: 1.5.0)
 Priority: Minor  (was: Major)

 Add a configuration to set the checkpoint directory for convenience.
 

 Key: SPARK-8820
 URL: https://issues.apache.org/jira/browse/SPARK-8820
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: SaintBacchus
Priority: Minor

 Add a configuration named *spark.streaming.checkpointDir*  to set the 
 checkpoint directory.
  It will overwrite by user if they also call *StreamingContext#checkpoint*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7977) Disallow println

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7977.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7093
[https://github.com/apache/spark/pull/7093]

 Disallow println
 

 Key: SPARK-7977
 URL: https://issues.apache.org/jira/browse/SPARK-7977
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Reynold Xin
  Labels: starter
 Fix For: 1.5.0


 Very often we see pull requests that added println from debugging, but the 
 author forgot to remove it before code review.
 We can use the regex checker to disallow println. For legitimate use of 
 println, we can then disable the rule where they are used.
 Add to scalastyle-config.xml file:
 {code}
   check customId=println level=error 
 class=org.scalastyle.scalariform.TokenChecker enabled=true
 parametersparameter name=regex^println$/parameter/parameters
 customMessage![CDATA[Are you sure you want to println? If yes, wrap 
 the code block with 
   // scalastyle:off println
   println(...)
   // scalastyle:on println]]/customMessage
   /check
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8960) Style cleanup of spark_ec2.py

2015-07-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622147#comment-14622147
 ] 

Sean Owen commented on SPARK-8960:
--

[~shivaram] [~nchammas] are you generally in favor of this idea or not so sure 
about it? I hadn't heard an objection to it and may free everyone up to work on 
the code more rapidly. Do you want to push on the separate repo idea? I'd 
support that.

 Style cleanup of spark_ec2.py
 -

 Key: SPARK-8960
 URL: https://issues.apache.org/jira/browse/SPARK-8960
 Project: Spark
  Issue Type: Task
  Components: EC2
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Trivial

 The spark_ec2.py script could use some cleanup I think. There are simple 
 style issues like mixing single and double quotes, but also some rather 
 un-Pythonic constructs (e.g. 
 https://github.com/apache/spark/pull/6336#commitcomment-12088624 that sparked 
 this JIRA). Whenever I read it, I always find something that is too minor for 
 a pull request/JIRA, but I'd fix it if it was my code. Perhaps we can address 
 such issues in this JIRA.
 The intention is not to introduce any behavioral changes. It's hard to verify 
 this without testing, so perhaps we should also add some kind of test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8815) illegal java package names in jar

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8815.
--
Resolution: Not A Problem

 illegal java package names in jar
 -

 Key: SPARK-8815
 URL: https://issues.apache.org/jira/browse/SPARK-8815
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Sam Halliday
Priority: Minor

 In ENSIME we were unable to index the spark jars and we investigated 
 further... you have classes that look like this:
 org.spark-project.guava.annotations.VisibleForTesting
 Hyphens are not legal package names according to the java language spec, so 
 I'm amazed that this can actually be read at runtime... certainly no compiler 
 I know would allow it.
 What I suspect is happening is that you're using a build plugin that 
 internalises some of your dependencies and it is using your groupId but not 
 validating it... and then blindly using that name in the ASM manipulation.
 You might want to report this upstream with your build plugin.
 For your next release, I recommend using an explicit name that is not your 
 groupId. i.e. convert hyphens to underscores as Gosling recommends.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8825) Spark build fails

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8825.
--
Resolution: Duplicate

 Spark build fails
 -

 Key: SPARK-8825
 URL: https://issues.apache.org/jira/browse/SPARK-8825
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.0
 Environment: Linux, Ubuntu 14.04
Reporter: Nicholas Brown

 Building spark (mvn install -DskipTests=true) is failing in the Spark Project 
 Core module.  The following error is being given:
 [ERROR]
  while compiling: 
 /home/nick/spark-1.4.0/core/src/main/scala/org/apache/spark/util/random/package.scala
 during phase: jvm
  library version: version 2.10.4
 compiler version: version 2.10.4
   reconstructed args: -deprecation -bootclasspath 
 /opt/jdk1.8.0_25/jre/lib/resources.jar:/opt/jdk1.8.0_25/jre/lib/rt.jar:/opt/jdk1.8.0_25/jre/lib/sunrsasign.jar:/opt/jdk1.8.0_25/jre/lib/jsse.jar:/opt/jdk1.8.0_25/jre/lib/jce.jar:/opt/jdk1.8.0_25/jre/lib/charsets.jar:/opt/jdk1.8.0_25/jre/lib/jfr.jar:/opt/jdk1.8.0_25/jre/classes:/home/nick/.m2/repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar
  -feature -classpath 
 

[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-07-10 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1466#comment-1466
 ] 

Daniel Darabos commented on SPARK-4879:
---

I wonder if this issue is serious enough to note in the documentation. What do 
you think about adding a big fat warning for speculative execution until it is 
fixed? Enabling speculative execution may lead to missing output files? Or 
perhaps add verification pass that checks if all the outputs are present and 
raises an exception if not.

Silently dropping output files is a horrible bug. We've been debugging a 
somewhat mythological data corruption issue for about a month, and now we 
realize that this issue (SPARK-4879) is a very plausible explanation. We have 
never been able to reproduce it, but we have a log file, and it shows a 
speculative task for a {{saveAsNewAPIHadoopFile}} stage.

 Missing output partitions after job completes with speculative execution
 

 Key: SPARK-4879
 URL: https://issues.apache.org/jira/browse/SPARK-4879
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0

 Attachments: speculation.txt, speculation2.txt


 When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
 save output files may report that they have completed successfully even 
 though some output partitions written by speculative tasks may be missing.
 h3. Reproduction
 This symptom was reported to me by a Spark user and I've been doing my own 
 investigation to try to come up with an in-house reproduction.
 I'm still working on a reliable local reproduction for this issue, which is a 
 little tricky because Spark won't schedule speculated tasks on the same host 
 as the original task, so you need an actual (or containerized) multi-host 
 cluster to test speculation.  Here's a simple reproduction of some of the 
 symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
 spark.speculation=true}}:
 {code}
 // Rig a job such that all but one of the tasks complete instantly
 // and one task runs for 20 seconds on its first attempt and instantly
 // on its second attempt:
 val numTasks = 100
 sc.parallelize(1 to numTasks, 
 numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =
   if (ctx.partitionId == 0) {  // If this is the one task that should run 
 really slow
 if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
  Thread.sleep(20 * 1000)
 }
   }
   iter
 }.map(x = (x, x)).saveAsTextFile(/test4)
 {code}
 When I run this, I end up with a job that completes quickly (due to 
 speculation) but reports failures from the speculated task:
 {code}
 [...]
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
 (100/100)
 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
 console:22) finished in 0.856 s
 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
 console:22, took 0.885438374 s
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
 for 70.1 in stage 3.0 because task 70 has already completed successfully
 scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
 stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
 java.io.IOException: Failed to save output of task: 
 attempt_201412110141_0003_m_49_413
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
 org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 One interesting thing to note about this stack trace: if we look at 
 

[jira] [Updated] (SPARK-8974) The spark-dynamic-executor-allocation may be not supported

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8974:
-
Target Version/s:   (was: 1.4.0)
   Fix Version/s: (was: 1.4.1)

[~KaiXinXIaoLei] I'd ask again that you read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  Your 
JIRA doesn't make sense: it can't be Fixed for 1.4.1 already, since there is no 
change here. It can't Target Version 1.4.0 which is already released

 The spark-dynamic-executor-allocation may be not supported
 --

 Key: SPARK-8974
 URL: https://issues.apache.org/jira/browse/SPARK-8974
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei

 In yarn-client mode and config option spark.dynamicAllocation.enabled  is 
 true, when the state of ApplicationMaster is dead or disconnected, if the 
 tasks are submitted  before new ApplicationMaster start. The thread of 
 spark-dynamic-executor-allocation will throw exception, When 
 ApplicationMaster is running and not tasks are running, the number of 
 executor is not zero. So feture of dynamicAllocation are not  supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-07-10 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1466#comment-1466
 ] 

Daniel Darabos edited comment on SPARK-4879 at 7/10/15 12:25 PM:
-

I wonder if this issue is serious enough to note in the documentation. What do 
you think about adding a big fat warning for speculative execution until it is 
fixed? Enabling speculative execution may lead to missing output files? Or 
perhaps add a verification pass that checks if all the outputs are present and 
raises an exception if not.

Silently dropping output files is a horrible bug. We've been debugging a 
somewhat mythological data corruption issue for about a month, and now we 
realize that this issue (SPARK-4879) is a very plausible explanation. We have 
never been able to reproduce it, but we have a log file, and it shows a 
speculative task for a {{saveAsNewAPIHadoopFile}} stage.


was (Author: darabos):
I wonder if this issue is serious enough to note in the documentation. What do 
you think about adding a big fat warning for speculative execution until it is 
fixed? Enabling speculative execution may lead to missing output files? Or 
perhaps add verification pass that checks if all the outputs are present and 
raises an exception if not.

Silently dropping output files is a horrible bug. We've been debugging a 
somewhat mythological data corruption issue for about a month, and now we 
realize that this issue (SPARK-4879) is a very plausible explanation. We have 
never been able to reproduce it, but we have a log file, and it shows a 
speculative task for a {{saveAsNewAPIHadoopFile}} stage.

 Missing output partitions after job completes with speculative execution
 

 Key: SPARK-4879
 URL: https://issues.apache.org/jira/browse/SPARK-4879
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0

 Attachments: speculation.txt, speculation2.txt


 When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
 save output files may report that they have completed successfully even 
 though some output partitions written by speculative tasks may be missing.
 h3. Reproduction
 This symptom was reported to me by a Spark user and I've been doing my own 
 investigation to try to come up with an in-house reproduction.
 I'm still working on a reliable local reproduction for this issue, which is a 
 little tricky because Spark won't schedule speculated tasks on the same host 
 as the original task, so you need an actual (or containerized) multi-host 
 cluster to test speculation.  Here's a simple reproduction of some of the 
 symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
 spark.speculation=true}}:
 {code}
 // Rig a job such that all but one of the tasks complete instantly
 // and one task runs for 20 seconds on its first attempt and instantly
 // on its second attempt:
 val numTasks = 100
 sc.parallelize(1 to numTasks, 
 numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =
   if (ctx.partitionId == 0) {  // If this is the one task that should run 
 really slow
 if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
  Thread.sleep(20 * 1000)
 }
   }
   iter
 }.map(x = (x, x)).saveAsTextFile(/test4)
 {code}
 When I run this, I end up with a job that completes quickly (due to 
 speculation) but reports failures from the speculated task:
 {code}
 [...]
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
 (100/100)
 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
 console:22) finished in 0.856 s
 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
 console:22, took 0.885438374 s
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
 for 70.1 in stage 3.0 because task 70 has already completed successfully
 scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
 stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
 java.io.IOException: Failed to save output of task: 
 attempt_201412110141_0003_m_49_413
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
 org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 

[jira] [Assigned] (SPARK-8995) Cast date strings with date, date and time and just time information to DateType and TimestampTzpe

2015-07-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8995:
---

Assignee: (was: Apache Spark)

 Cast date strings with date, date and time and just time information to 
 DateType and TimestampTzpe
 --

 Key: SPARK-8995
 URL: https://issues.apache.org/jira/browse/SPARK-8995
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Tarek Auel

 Tests of https://github.com/apache/spark/pull/6981 fails, because we can not 
 cast strings like '13:18:08' to a valid date and extract the hours later. 
 It's not possible to parse strings that contains date and time information to 
 date, like '2015-03-18 12:25:49'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7735) Raise Exception on non-zero exit from pyspark pipe commands

2015-07-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-7735:
--
Assignee: (was: Davies Liu)

 Raise Exception on non-zero exit from pyspark pipe commands
 ---

 Key: SPARK-7735
 URL: https://issues.apache.org/jira/browse/SPARK-7735
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0, 1.3.1
Reporter: Scott Taylor
Priority: Minor
  Labels: newbie, patch
 Fix For: 1.5.0


 In pyspark errors are ignored when using the rdd.pipe function. This is 
 different to the scala behaviour where abnormal exit of the piped command is 
 raised. I have submitted a pull request on github which I believe will bring 
 the pyspark behaviour closer to the scala behaviour.
 A simple case of where this bug may be problematic is using a network bash 
 utility to perform computations on an rdd. Currently, network errors will be 
 ignored and blank results returned when it would be more desirable to raise 
 an exception so that spark can retry the failed task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8995) Cast date strings with date, date and time and just time information to DateType and TimestampTzpe

2015-07-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8995:
---

Assignee: Apache Spark

 Cast date strings with date, date and time and just time information to 
 DateType and TimestampTzpe
 --

 Key: SPARK-8995
 URL: https://issues.apache.org/jira/browse/SPARK-8995
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Tarek Auel
Assignee: Apache Spark

 Tests of https://github.com/apache/spark/pull/6981 fails, because we can not 
 cast strings like '13:18:08' to a valid date and extract the hours later. 
 It's not possible to parse strings that contains date and time information to 
 date, like '2015-03-18 12:25:49'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8995) Cast date strings with date, date and time and just time information to DateType and TimestampTzpe

2015-07-10 Thread Tarek Auel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarek Auel updated SPARK-8995:
--
Description: Tests of https://github.com/apache/spark/pull/6981 fail, 
because we can not cast strings like '13:18:08' to a valid date and extract the 
hours later. It's not possible to parse strings that contains date and time 
information to date, like '2015-03-18 12:25:49'  (was: Tests of 
https://github.com/apache/spark/pull/6981 fails, because we can not cast 
strings like '13:18:08' to a valid date and extract the hours later. It's not 
possible to parse strings that contains date and time information to date, like 
'2015-03-18 12:25:49')

 Cast date strings with date, date and time and just time information to 
 DateType and TimestampTzpe
 --

 Key: SPARK-8995
 URL: https://issues.apache.org/jira/browse/SPARK-8995
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Tarek Auel

 Tests of https://github.com/apache/spark/pull/6981 fail, because we can not 
 cast strings like '13:18:08' to a valid date and extract the hours later. 
 It's not possible to parse strings that contains date and time information to 
 date, like '2015-03-18 12:25:49'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8998) Collect enough frequent prefixes before projection in PrefixSpan

2015-07-10 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-8998:


 Summary: Collect enough frequent prefixes before projection in 
PrefixSpan
 Key: SPARK-8998
 URL: https://issues.apache.org/jira/browse/SPARK-8998
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Zhang JiaJin


The implementation in SPARK-6487 might have scalability issues when the number 
of frequent items is very small. In this case, we can generate candidate sets 
of higher orders using Apriori-like algorithms and count them, until we collect 
enough prefixes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8974) The spark-dynamic-executor-allocation may be not supported

2015-07-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8974:
---

Assignee: (was: Apache Spark)

 The spark-dynamic-executor-allocation may be not supported
 --

 Key: SPARK-8974
 URL: https://issues.apache.org/jira/browse/SPARK-8974
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei
 Fix For: 1.5.0


 In yarn-client mode and config option spark.dynamicAllocation.enabled  is 
 true, when the state of ApplicationMaster is dead or disconnected, if the 
 tasks are submitted  before new ApplicationMaster start. The thread of 
 spark-dynamic-executor-allocation will throw exception, When 
 ApplicationMaster is running and not tasks are running, the number of 
 executor is not zero. So feture of dynamicAllocation are not  supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8598) Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs

2015-07-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-8598.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6994
[https://github.com/apache/spark/pull/6994]

 Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs
 ---

 Key: SPARK-8598
 URL: https://issues.apache.org/jira/browse/SPARK-8598
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Jose Cambronero
Assignee: Jose Cambronero
Priority: Minor
 Fix For: 1.5.0


 We have implemented a 1-sample, two-sided version of the Kolmogorov Smirnov 
 test, which tests the null hypothesis that the sample comes from a given 
 continuous distribution. We provide various functions to access the 
 functionality: namely, a function that takes an RDD[Double] of the data and a 
 lambda to calculate the CDF, a function that takes an RDD[Double] and an 
 Iterator[(Double,Double,Double)] = Iterator[Double] which uses mapPartition 
 to provide an optimized way to perform the calculation when the CDF 
 calculation requires a non-serializable object (e.g. the apache math commons 
 real distributions), and finally a function that takes an RDD[Double] and a 
 String name of the theoretical distribution to be used. The appropriate 
 result class has been added, as well as tests to the HypothesisTestSuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8997) Improve LocalPrefixSpan performance

2015-07-10 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-8997:


 Summary: Improve LocalPrefixSpan performance
 Key: SPARK-8997
 URL: https://issues.apache.org/jira/browse/SPARK-8997
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Feynman Liang


We can improve the performance by:

1. run should output Iterator instead of Array
2. Local count shouldn't use groupBy, which creates too many arrays. We can use 
PrimitiveKeyOpenHashMap
3. We can use list to avoid materialize frequent sequences



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib

2015-07-10 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623218#comment-14623218
 ] 

Xiangrui Meng commented on SPARK-6487:
--

Please check linked JIRAs for follow-up work.

 Add sequential pattern mining algorithm to Spark MLlib
 --

 Key: SPARK-6487
 URL: https://issues.apache.org/jira/browse/SPARK-6487
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Zhang JiaJin
Assignee: Zhang JiaJin
Priority: Critical
 Fix For: 1.5.0


 [~mengxr] [~zhangyouhua]
 Sequential pattern mining is an important branch in the pattern mining. In 
 the past the actual work, we use the sequence mining (mainly PrefixSpan 
 algorithm) to find the telecommunication signaling sequence pattern, achieved 
 good results. But once the data is too large, the operation time is too long, 
 even can not meet the the service requirements. We are ready to implement the 
 PrefixSpan algorithm in spark, and applied to our subsequent work. 
 The related Paper: 
 PrefixSpan: 
 Pei, Jian, et al. Mining sequential patterns by pattern-growth: The 
 prefixspan approach. Knowledge and Data Engineering, IEEE Transactions on 
 16.11 (2004): 1424-1440.
 Parallel Algorithm: 
 Cong, Shengnan, Jiawei Han, and David Padua. Parallel mining of closed 
 sequential patterns. Proceedings of the eleventh ACM SIGKDD international 
 conference on Knowledge discovery in data mining. ACM, 2005.
 Distributed Algorithm: 
 Wei, Yong-qing, Dong Liu, and Lin-shan Duan. Distributed PrefixSpan 
 algorithm based on MapReduce. Information Technology in Medicine and 
 Education (ITME), 2012 International Symposium on. Vol. 2. IEEE, 2012.
 Pattern mining and sequential mining Knowledge: 
 Han, Jiawei, et al. Frequent pattern mining: current status and future 
 directions. Data Mining and Knowledge Discovery 15.1 (2007): 55-86.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8999) Support non-temporal sequence in PrefixSpan

2015-07-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8999:
-
Description: In SPARK-6487, we assume that all items are ordered. However, 
we should support non-temporal sequences in PrefixSpan. This should be done 
before 1.5 because it changes PrefixSpan APIs.  (was: In SPARK-6487, we assume 
that all items are ordered. However, we should support non-temporal sequences 
in PrefixSpan.)

 Support non-temporal sequence in PrefixSpan
 ---

 Key: SPARK-8999
 URL: https://issues.apache.org/jira/browse/SPARK-8999
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Priority: Critical

 In SPARK-6487, we assume that all items are ordered. However, we should 
 support non-temporal sequences in PrefixSpan. This should be done before 1.5 
 because it changes PrefixSpan APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8999) Support non-temporal sequence in PrefixSpan

2015-07-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8999:
-
Description: In SPARK-6487, we assume that all items are ordered. However, 
we should support non-temporal sequences in PrefixSpan.

 Support non-temporal sequence in PrefixSpan
 ---

 Key: SPARK-8999
 URL: https://issues.apache.org/jira/browse/SPARK-8999
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Priority: Critical

 In SPARK-6487, we assume that all items are ordered. However, we should 
 support non-temporal sequences in PrefixSpan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8999) Support non-temporal sequence in PrefixSpan

2015-07-10 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-8999:


 Summary: Support non-temporal sequence in PrefixSpan
 Key: SPARK-8999
 URL: https://issues.apache.org/jira/browse/SPARK-8999
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8962) Disallow Class.forName

2015-07-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8962:
---

Assignee: (was: Apache Spark)

 Disallow Class.forName
 --

 Key: SPARK-8962
 URL: https://issues.apache.org/jira/browse/SPARK-8962
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Josh Rosen

 We should add a regex rule to Scalastyle which prohibits the use of 
 {{Class.forName}}.  We should not use Class.forName directly because this 
 will load classes from the system's default class loader rather than the 
 appropriate context loader.  Instead, we should be calling Utils.classForName 
 instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8962) Disallow Class.forName

2015-07-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623067#comment-14623067
 ] 

Apache Spark commented on SPARK-8962:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7350

 Disallow Class.forName
 --

 Key: SPARK-8962
 URL: https://issues.apache.org/jira/browse/SPARK-8962
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Josh Rosen

 We should add a regex rule to Scalastyle which prohibits the use of 
 {{Class.forName}}.  We should not use Class.forName directly because this 
 will load classes from the system's default class loader rather than the 
 appropriate context loader.  Instead, we should be calling Utils.classForName 
 instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8834) Throttle DStreams dynamically through back-pressure information

2015-07-10 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-8834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

François Garillot closed SPARK-8834.

Resolution: Duplicate

 Throttle DStreams dynamically through back-pressure information
 ---

 Key: SPARK-8834
 URL: https://issues.apache.org/jira/browse/SPARK-8834
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: François Garillot

 This aims to have Spark Streaming be more resilient to high-throughput 
 situations through back-pressure signaling  dynamic throttling. 
 The Design doc can be found there:
 https://issues.apache.org/jira/browse/SPARK-8834
 An (outdated) [PoC 
 implementation|https://github.com/typesafehub/spark/pull/13] exists.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib

2015-07-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6487.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7258
[https://github.com/apache/spark/pull/7258]

 Add sequential pattern mining algorithm to Spark MLlib
 --

 Key: SPARK-6487
 URL: https://issues.apache.org/jira/browse/SPARK-6487
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Zhang JiaJin
Assignee: Zhang JiaJin
Priority: Critical
 Fix For: 1.5.0


 [~mengxr] [~zhangyouhua]
 Sequential pattern mining is an important branch in the pattern mining. In 
 the past the actual work, we use the sequence mining (mainly PrefixSpan 
 algorithm) to find the telecommunication signaling sequence pattern, achieved 
 good results. But once the data is too large, the operation time is too long, 
 even can not meet the the service requirements. We are ready to implement the 
 PrefixSpan algorithm in spark, and applied to our subsequent work. 
 The related Paper: 
 PrefixSpan: 
 Pei, Jian, et al. Mining sequential patterns by pattern-growth: The 
 prefixspan approach. Knowledge and Data Engineering, IEEE Transactions on 
 16.11 (2004): 1424-1440.
 Parallel Algorithm: 
 Cong, Shengnan, Jiawei Han, and David Padua. Parallel mining of closed 
 sequential patterns. Proceedings of the eleventh ACM SIGKDD international 
 conference on Knowledge discovery in data mining. ACM, 2005.
 Distributed Algorithm: 
 Wei, Yong-qing, Dong Liu, and Lin-shan Duan. Distributed PrefixSpan 
 algorithm based on MapReduce. Information Technology in Medicine and 
 Education (ITME), 2012 International Symposium on. Vol. 2. IEEE, 2012.
 Pattern mining and sequential mining Knowledge: 
 Han, Jiawei, et al. Frequent pattern mining: current status and future 
 directions. Data Mining and Knowledge Discovery 15.1 (2007): 55-86.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8994) Tiny cleanups to Params, Pipeline

2015-07-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-8994.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7349
[https://github.com/apache/spark/pull/7349]

 Tiny cleanups to Params, Pipeline
 -

 Key: SPARK-8994
 URL: https://issues.apache.org/jira/browse/SPARK-8994
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial
 Fix For: 1.5.0


 Small cleanups per remaining comments in 
 [https://github.com/apache/spark/pull/5820] which resolved [SPARK-5956]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8941) Standalone cluster worker does not accept multiple masters on launch

2015-07-10 Thread Jesper Lundgren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623226#comment-14623226
 ] 

Jesper Lundgren edited comment on SPARK-8941 at 7/11/15 4:31 AM:
-

Maybe it is better to close this issue and open a new one for the API change 
and the documentation issues.

I'll try to review some of the issues we had with the stand alone cluster and 
see if I should create JIRA tickets for some of them.

ex, when using supervised mode in HA cluster, there is not a well documented 
procedure to force stop and disable restart of a driver (in case the driver 
exits with the wrong exit code). I know of the kill command
bin/spark-class org.apache.spark.deploy.Client kill
But in my experience it does not always work.




was (Author: koudelka):
Maybe it is better to close this issue and open a new one for the API change 
and the documentation issues.

I'll probably try to review some of the issues we had with the stand alone 
cluster and see if I should create JIRA tickets for some of them.

ex, when using supervised mode in HA cluster, there is not a well documented 
procedure to force stop and disable restart of a driver (in case the driver 
exits with the wrong exit code). I know of the kill command
bin/spark-class org.apache.spark.deploy.Client kill
But in my experience it does not always work.



 Standalone cluster worker does not accept multiple masters on launch
 

 Key: SPARK-8941
 URL: https://issues.apache.org/jira/browse/SPARK-8941
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Documentation
Affects Versions: 1.4.0, 1.4.1
Reporter: Jesper Lundgren
Priority: Critical

 Before 1.4 it was possible to launch a worker node using a comma separated 
 list of master nodes. 
 ex:
 sbin/start-slave.sh 1 spark://localhost:7077,localhost:7078
 starting org.apache.spark.deploy.worker.Worker, logging to 
 /Users/jesper/Downloads/spark-1.4.0-bin-cdh4/sbin/../logs/spark-jesper-org.apache.spark.deploy.worker.Worker-1-Jespers-MacBook-Air.local.out
 failed to launch org.apache.spark.deploy.worker.Worker:
  Default is conf/spark-defaults.conf.
   15/07/09 12:33:06 INFO Utils: Shutdown hook called
 Spark 1.2 and 1.3.1 accepts multiple masters in this format.
 update: start-slave.sh only expects master lists in 1.4 (no instance number)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8941) Standalone cluster worker does not accept multiple masters on launch

2015-07-10 Thread Jesper Lundgren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623226#comment-14623226
 ] 

Jesper Lundgren edited comment on SPARK-8941 at 7/11/15 4:31 AM:
-

Maybe it is better to close this issue and open a new one for the API change 
and the documentation issues.

I'll try to review some of the issues we have had with the stand alone cluster 
to see if I should create JIRA tickets for some of them.

ex, when using supervised mode in HA cluster, there is not a well documented 
procedure to force stop and disable restart of a driver (in case the driver 
exits with the wrong exit code). I know of the kill command
bin/spark-class org.apache.spark.deploy.Client kill
But in my experience it does not always work.




was (Author: koudelka):
Maybe it is better to close this issue and open a new one for the API change 
and the documentation issues.

I'll try to review some of the issues we had with the stand alone cluster and 
see if I should create JIRA tickets for some of them.

ex, when using supervised mode in HA cluster, there is not a well documented 
procedure to force stop and disable restart of a driver (in case the driver 
exits with the wrong exit code). I know of the kill command
bin/spark-class org.apache.spark.deploy.Client kill
But in my experience it does not always work.



 Standalone cluster worker does not accept multiple masters on launch
 

 Key: SPARK-8941
 URL: https://issues.apache.org/jira/browse/SPARK-8941
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Documentation
Affects Versions: 1.4.0, 1.4.1
Reporter: Jesper Lundgren
Priority: Critical

 Before 1.4 it was possible to launch a worker node using a comma separated 
 list of master nodes. 
 ex:
 sbin/start-slave.sh 1 spark://localhost:7077,localhost:7078
 starting org.apache.spark.deploy.worker.Worker, logging to 
 /Users/jesper/Downloads/spark-1.4.0-bin-cdh4/sbin/../logs/spark-jesper-org.apache.spark.deploy.worker.Worker-1-Jespers-MacBook-Air.local.out
 failed to launch org.apache.spark.deploy.worker.Worker:
  Default is conf/spark-defaults.conf.
   15/07/09 12:33:06 INFO Utils: Shutdown hook called
 Spark 1.2 and 1.3.1 accepts multiple masters in this format.
 update: start-slave.sh only expects master lists in 1.4 (no instance number)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6882) Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]

2015-07-10 Thread Andrew Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623230#comment-14623230
 ] 

Andrew Lee commented on SPARK-6882:
---

I don't think updating spark-env.sh {{SPARK_CLASSPATH}} will be a good idea 
since this conflicts with {{--driver-class-path}} in yarn-client mode.
But if this is the current work around, I can specify it with a different 
directory with SPARK_CONF_DIR just to get it up and running.

Regarding Bin's approach, I believe you will need to enable 
{{spark.yarn.user.classpath.first}} according to SPARK-939, but I think it 
should be picking up user JAR y default now, isn't?

 Spark ThriftServer2 Kerberos failed encountering 
 java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
 are: [auth-int, auth-conf, auth]
 

 Key: SPARK-6882
 URL: https://issues.apache.org/jira/browse/SPARK-6882
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1, 1.3.0, 1.4.0
 Environment: * Apache Hadoop 2.4.1 with Kerberos Enabled
 * Apache Hive 0.13.1
 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
Reporter: Andrew Lee

 When Kerberos is enabled, I get the following exceptions. 
 {code}
 2015-03-13 18:26:05,363 ERROR 
 org.apache.hive.service.cli.thrift.ThriftCLIService 
 (ThriftBinaryCLIService.java:run(93)) - Error: 
 java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
 are: [auth-int, auth-conf, auth]
 {code}
 I tried it in
 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
 with
 * Apache Hive 0.13.1
 * Apache Hadoop 2.4.1
 Build command
 {code}
 mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver 
 -Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests 
 install
 {code}
 When starting Spark ThriftServer in {{yarn-client}} mode, the command to 
 start thriftserver looks like this
 {code}
 ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf 
 hive.server2.thrift.bind.host=$(hostname) --master yarn-client
 {code}
 {{hostname}} points to the current hostname of the machine I'm using.
 Error message in {{spark.log}} from Spark 1.2.1 (1.2 rc1)
 {code}
 2015-03-13 18:26:05,363 ERROR 
 org.apache.hive.service.cli.thrift.ThriftCLIService 
 (ThriftBinaryCLIService.java:run(93)) - Error: 
 java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
 are: [auth-int, auth-conf, auth]
 at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56)
 at 
 org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118)
 at 
 org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133)
 at 
 org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 I'm wondering if this is due to the same problem described in HIVE-8154 
 HIVE-7620 due to an older code based for the Spark ThriftServer?
 Any insights are appreciated. Currently, I can't get Spark ThriftServer2 to 
 run against a Kerberos cluster (Apache 2.4.1).
 My hive-site.xml looks like the following for spark/conf.
 The kerberos keytab and tgt are configured correctly, I'm able to connect to 
 metastore, but the subsequent steps failed due to the exception.
 {code}
 property
   namehive.semantic.analyzer.factory.impl/name
   valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value
 /property
 property
   namehive.metastore.execute.setugi/name
   valuetrue/value
 /property
 property
   namehive.stats.autogather/name
   valuefalse/value
 /property
 property
   namehive.session.history.enabled/name
   valuetrue/value
 /property
 property
   namehive.querylog.location/name
   value/tmp/home/hive/log/${user.name}/value
 /property
 property
   namehive.exec.local.scratchdir/name
   value/tmp/hive/scratch/${user.name}/value
 /property
 property
   namehive.metastore.uris/name
   valuethrift://somehostname:9083/value
 /property
 !-- HIVE SERVER 2 --
 property
   namehive.server2.authentication/name
   valueKERBEROS/value
 /property
 property
   namehive.server2.authentication.kerberos.principal/name
   value***/value
 /property
 property
   namehive.server2.authentication.kerberos.keytab/name
   value***/value
 /property
 property
   namehive.server2.thrift.sasl.qop/name
   valueauth/value
   descriptionSasl QOP value; one of 'auth', 'auth-int' and 
 'auth-conf'/description
 /property
 property
   

[jira] [Comment Edited] (SPARK-6882) Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]

2015-07-10 Thread Andrew Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623230#comment-14623230
 ] 

Andrew Lee edited comment on SPARK-6882 at 7/11/15 4:35 AM:


I don't think updating spark-env.sh {{SPARK_CLASSPATH}} will be a good idea 
since this conflicts with {{--driver-class-path}} in yarn-client mode.
But if this is the current work around, I can specify it with a different 
directory with SPARK_CONF_DIR just to get it up and running.

Regarding Bin's approach, I believe you will need to enable 
{{spark.yarn.user.classpath.first}} according to SPARK-939, but I think it 
should be picking up user JAR by default now, isn't?


was (Author: alee526):
I don't think updating spark-env.sh {{SPARK_CLASSPATH}} will be a good idea 
since this conflicts with {{--driver-class-path}} in yarn-client mode.
But if this is the current work around, I can specify it with a different 
directory with SPARK_CONF_DIR just to get it up and running.

Regarding Bin's approach, I believe you will need to enable 
{{spark.yarn.user.classpath.first}} according to SPARK-939, but I think it 
should be picking up user JAR y default now, isn't?

 Spark ThriftServer2 Kerberos failed encountering 
 java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
 are: [auth-int, auth-conf, auth]
 

 Key: SPARK-6882
 URL: https://issues.apache.org/jira/browse/SPARK-6882
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1, 1.3.0, 1.4.0
 Environment: * Apache Hadoop 2.4.1 with Kerberos Enabled
 * Apache Hive 0.13.1
 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
Reporter: Andrew Lee

 When Kerberos is enabled, I get the following exceptions. 
 {code}
 2015-03-13 18:26:05,363 ERROR 
 org.apache.hive.service.cli.thrift.ThriftCLIService 
 (ThriftBinaryCLIService.java:run(93)) - Error: 
 java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
 are: [auth-int, auth-conf, auth]
 {code}
 I tried it in
 * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
 * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
 with
 * Apache Hive 0.13.1
 * Apache Hadoop 2.4.1
 Build command
 {code}
 mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver 
 -Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests 
 install
 {code}
 When starting Spark ThriftServer in {{yarn-client}} mode, the command to 
 start thriftserver looks like this
 {code}
 ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf 
 hive.server2.thrift.bind.host=$(hostname) --master yarn-client
 {code}
 {{hostname}} points to the current hostname of the machine I'm using.
 Error message in {{spark.log}} from Spark 1.2.1 (1.2 rc1)
 {code}
 2015-03-13 18:26:05,363 ERROR 
 org.apache.hive.service.cli.thrift.ThriftCLIService 
 (ThriftBinaryCLIService.java:run(93)) - Error: 
 java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
 are: [auth-int, auth-conf, auth]
 at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56)
 at 
 org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118)
 at 
 org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133)
 at 
 org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 I'm wondering if this is due to the same problem described in HIVE-8154 
 HIVE-7620 due to an older code based for the Spark ThriftServer?
 Any insights are appreciated. Currently, I can't get Spark ThriftServer2 to 
 run against a Kerberos cluster (Apache 2.4.1).
 My hive-site.xml looks like the following for spark/conf.
 The kerberos keytab and tgt are configured correctly, I'm able to connect to 
 metastore, but the subsequent steps failed due to the exception.
 {code}
 property
   namehive.semantic.analyzer.factory.impl/name
   valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value
 /property
 property
   namehive.metastore.execute.setugi/name
   valuetrue/value
 /property
 property
   namehive.stats.autogather/name
   valuefalse/value
 /property
 property
   namehive.session.history.enabled/name
   valuetrue/value
 /property
 property
   namehive.querylog.location/name
   value/tmp/home/hive/log/${user.name}/value
 /property
 property
   namehive.exec.local.scratchdir/name
   value/tmp/hive/scratch/${user.name}/value
 /property
 property
   

[jira] [Created] (SPARK-9000) Support generic item type in PrefixSpan

2015-07-10 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-9000:


 Summary: Support generic item type in PrefixSpan
 Key: SPARK-9000
 URL: https://issues.apache.org/jira/browse/SPARK-9000
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Priority: Critical


In SPARK-6487, we only support Int type. It requires users to encode other 
types into integer to use PrefixSpan. We should be able to do this inside 
PrefixSpan, similar to FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9000) Support generic item type in PrefixSpan

2015-07-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9000:
-
Description: In SPARK-6487, we only support Int type. It requires users to 
encode other types into integer to use PrefixSpan. We should be able to do this 
inside PrefixSpan, similar to FPGrowth. This should be done before 1.5 since it 
changes APIs.  (was: In SPARK-6487, we only support Int type. It requires users 
to encode other types into integer to use PrefixSpan. We should be able to do 
this inside PrefixSpan, similar to FPGrowth.)

 Support generic item type in PrefixSpan
 ---

 Key: SPARK-9000
 URL: https://issues.apache.org/jira/browse/SPARK-9000
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Priority: Critical

 In SPARK-6487, we only support Int type. It requires users to encode other 
 types into integer to use PrefixSpan. We should be able to do this inside 
 PrefixSpan, similar to FPGrowth. This should be done before 1.5 since it 
 changes APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8835) Provide pluggable Congestion Strategies to deal with Streaming load

2015-07-10 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-8835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

François Garillot updated SPARK-8835:
-
Description: 
Second part of [SPARK-7398|https://issues.apache.org/jira/browse/SPARK-7398] 
(which has an over-arching, high-level design doc).

An (outdated) [PoC 
implementation|https://github.com/huitseeker/spark/tree/ReactiveStreamingBackPressureControl/]
 exists.

  was:
Second part of [SPARK-7398|https://issues.apache.org/jira/browse/SPARK-7398] 
(which has an over-arching, high-level design doc).

An (outdated) [PoC implementation|https://github.com/typesafehub/spark/pull/13] 
exists.


 Provide pluggable Congestion Strategies to deal with Streaming load
 ---

 Key: SPARK-8835
 URL: https://issues.apache.org/jira/browse/SPARK-8835
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: François Garillot

 Second part of [SPARK-7398|https://issues.apache.org/jira/browse/SPARK-7398] 
 (which has an over-arching, high-level design doc).
 An (outdated) [PoC 
 implementation|https://github.com/huitseeker/spark/tree/ReactiveStreamingBackPressureControl/]
  exists.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8986) GaussianMixture should take smoothing param

2015-07-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623071#comment-14623071
 ] 

Joseph K. Bradley commented on SPARK-8986:
--

Thanks for looking around some!  I was not really thinking of anything fancy.  
I was hoping existing libraries would do something like add a small constant to 
the diagonal of the covariance matrix of each Gaussian.  If there is no 
standard to follow, we could just do that.

It'd be interesting to investigate fancier approaches in another JIRA.

 GaussianMixture should take smoothing param
 ---

 Key: SPARK-8986
 URL: https://issues.apache.org/jira/browse/SPARK-8986
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
   Original Estimate: 144h
  Remaining Estimate: 144h

 Gaussian mixture models should take a smoothing parameter which makes the 
 algorithm robust against degenerate data or bad initializations.
 Whomever takes this JIRA should look at other libraries (sklearn, R packages, 
 Weka, etc.) to see how they do smoothing and what their API looks like.  
 Please summarize your findings here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8996) Add Python API for Kolmogorov-Smirnov Test

2015-07-10 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-8996:


 Summary: Add Python API for Kolmogorov-Smirnov Test
 Key: SPARK-8996
 URL: https://issues.apache.org/jira/browse/SPARK-8996
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng


Add Python API for the Kolmogorov-Smirnov test implemented in SPARK-8598. It 
should be similar to ChiSqTest in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8941) Standalone cluster worker does not accept multiple masters on launch

2015-07-10 Thread Jesper Lundgren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623226#comment-14623226
 ] 

Jesper Lundgren commented on SPARK-8941:


Maybe it is better to close this issue and open a new one for the API change 
and the documentation issues.

I'll probably try to review some of the issues we had with the stand alone 
cluster and see if I should create JIRA tickets for some of them.

ex, when using supervised mode in HA cluster, there is not a well documented 
procedure to force stop and disable restart of a driver (in case the driver 
exits with the wrong exit code). I know of the kill command
bin/spark-class org.apache.spark.deploy.Client kill
But in my experience it does not always work.



 Standalone cluster worker does not accept multiple masters on launch
 

 Key: SPARK-8941
 URL: https://issues.apache.org/jira/browse/SPARK-8941
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Documentation
Affects Versions: 1.4.0, 1.4.1
Reporter: Jesper Lundgren
Priority: Critical

 Before 1.4 it was possible to launch a worker node using a comma separated 
 list of master nodes. 
 ex:
 sbin/start-slave.sh 1 spark://localhost:7077,localhost:7078
 starting org.apache.spark.deploy.worker.Worker, logging to 
 /Users/jesper/Downloads/spark-1.4.0-bin-cdh4/sbin/../logs/spark-jesper-org.apache.spark.deploy.worker.Worker-1-Jespers-MacBook-Air.local.out
 failed to launch org.apache.spark.deploy.worker.Worker:
  Default is conf/spark-defaults.conf.
   15/07/09 12:33:06 INFO Utils: Shutdown hook called
 Spark 1.2 and 1.3.1 accepts multiple masters in this format.
 update: start-slave.sh only expects master lists in 1.4 (no instance number)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8962) Disallow Class.forName

2015-07-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8962:
---

Assignee: Apache Spark

 Disallow Class.forName
 --

 Key: SPARK-8962
 URL: https://issues.apache.org/jira/browse/SPARK-8962
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Josh Rosen
Assignee: Apache Spark

 We should add a regex rule to Scalastyle which prohibits the use of 
 {{Class.forName}}.  We should not use Class.forName directly because this 
 will load classes from the system's default class loader rather than the 
 appropriate context loader.  Instead, we should be calling Utils.classForName 
 instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6684) Add checkpointing to GradientBoostedTrees

2015-07-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623095#comment-14623095
 ] 

Joseph K. Bradley commented on SPARK-6684:
--

I have heard this may be an issue for some users who use many iterations.

 Add checkpointing to GradientBoostedTrees
 -

 Key: SPARK-6684
 URL: https://issues.apache.org/jira/browse/SPARK-6684
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 We should add checkpointing to GradientBoostedTrees since it maintains RDDs 
 with long lineages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8977) Define the RateEstimator interface, and implement the ReceiverRateController

2015-07-10 Thread Iulian Dragos (JIRA)
Iulian Dragos created SPARK-8977:


 Summary: Define the RateEstimator interface, and implement the 
ReceiverRateController
 Key: SPARK-8977
 URL: https://issues.apache.org/jira/browse/SPARK-8977
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Iulian Dragos
 Fix For: 1.5.0


Full [design 
doc|https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing]

Implement a rate controller for receiver-based InputDStreams that estimates a 
maximum rate and sends it to each receiver supervisor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8013) Get JDBC server working with Scala 2.11

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8013.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6903
[https://github.com/apache/spark/pull/6903]

 Get JDBC server working with Scala 2.11
 ---

 Key: SPARK-8013
 URL: https://issues.apache.org/jira/browse/SPARK-8013
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Patrick Wendell
Assignee: Iulian Dragos
Priority: Critical
 Fix For: 1.5.0


 It's worth some investigation here, but I believe the simplest solution is to 
 see if we can get Scala to shade it's use of JLine to avoid JLine conflicts 
 between Hive and the Spark repl.
 It's also possible that there is a simpler internal solution to the conflict 
 (I haven't looked at it in a long time). So doing some investigation of that 
 would be good. IIRC, there is use of Jline in our own repl code, in addition 
 to in Hive and also in the Scala 2.11 repl. Back when we created the 2.11 
 build I couldn't harmonize all the versions in a nice way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8980) Setup cluster with spark-ec2 scripts as non-root user

2015-07-10 Thread Mathieu DESPRIEE (JIRA)
Mathieu DESPRIEE created SPARK-8980:
---

 Summary: Setup cluster with spark-ec2 scripts as non-root user
 Key: SPARK-8980
 URL: https://issues.apache.org/jira/browse/SPARK-8980
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.4.0
Reporter: Mathieu DESPRIEE
Priority: Minor


Spark-ec2 scripts installs everything as root, which is not a best practice.
Suggestion to use a sudoer instead (ec2-user, available in the AMI, is).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6154) Support Kafka, JDBC in Scala 2.11

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6154.
--
   Resolution: Fixed
 Assignee: Iulian Dragos
Fix Version/s: 1.5.0

I think this is resolved by SPARK-8013, effectively? 
https://github.com/apache/spark/pull/6903

 Support Kafka, JDBC in Scala 2.11
 -

 Key: SPARK-6154
 URL: https://issues.apache.org/jira/browse/SPARK-6154
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
Reporter: Jianshi Huang
Assignee: Iulian Dragos
 Fix For: 1.5.0


 Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation 
 failed when -Phive-thriftserver is enabled.
 [info] Compiling 9 Scala sources to 
 /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes...
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2
 5: object ConsoleReader is not a member of package jline
 [error] import jline.{ConsoleReader, History}
 [error]^
 [warn] Class jline.Completor not found - continuing with a stub.
 [warn] Class jline.ConsoleReader not found - continuing with a stub.
 [error] 
 /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1
 65: not found: type ConsoleReader
 [error] val reader = new ConsoleReader()
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8976) Python 3 crash: ValueError: invalid mode 'a+' (only r, w, b allowed)

2015-07-10 Thread Olivier Delalleau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622466#comment-14622466
 ] 

Olivier Delalleau commented on SPARK-8976:
--

NB: I fixed the issue replacing l. 149 of worker.py with:
sock_file = sock.makefile(rwb, 65536)
but I'm not sure it's a good fix (and I don't know if it's compabile with 
Python 2)

 Python 3 crash: ValueError: invalid mode 'a+' (only r, w, b allowed)
 

 Key: SPARK-8976
 URL: https://issues.apache.org/jira/browse/SPARK-8976
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: Windows 7
Reporter: Olivier Delalleau

 See Github report: 
 https://github.com/apache/spark/pull/5173#issuecomment-113410652



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7944) Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7944:
-
Assignee: Iulian Dragos

 Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path
 

 Key: SPARK-7944
 URL: https://issues.apache.org/jira/browse/SPARK-7944
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.3.1, 1.4.0
 Environment: scala 2.11
Reporter: Alexander Nakos
Assignee: Iulian Dragos
Priority: Critical
 Fix For: 1.5.0

 Attachments: spark_shell_output.txt, spark_shell_output_2.10.txt


 When I run the spark-shell with the --jars argument and supply a path to a 
 single jar file, none of the classes in the jar are available in the REPL.
 I have encountered this same behaviour in both 1.3.1 and 1.4.0_RC-03 builds 
 for scala 2.11. I have yet to do a 1.4.0 RC-03 build for scala 2.10, but the 
 contents of the jar are available in the 1.3.1_2.10 REPL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8980) Setup cluster with spark-ec2 scripts as non-root user

2015-07-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622482#comment-14622482
 ] 

Sean Owen commented on SPARK-8980:
--

Wasn't the conclusion from the thread that this isn't going to happen?

 Setup cluster with spark-ec2 scripts as non-root user
 -

 Key: SPARK-8980
 URL: https://issues.apache.org/jira/browse/SPARK-8980
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.4.0
Reporter: Mathieu DESPRIEE
Priority: Minor

 Spark-ec2 scripts installs everything as root, which is not a best practice.
 Suggestion to use a sudoer instead (ec2-user, available in the AMI, is).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7944) Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7944.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6903
[https://github.com/apache/spark/pull/6903]

 Spark-Shell 2.11 1.4.0-RC-03 does not add jars to class path
 

 Key: SPARK-7944
 URL: https://issues.apache.org/jira/browse/SPARK-7944
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.3.1, 1.4.0
 Environment: scala 2.11
Reporter: Alexander Nakos
Priority: Critical
 Fix For: 1.5.0

 Attachments: spark_shell_output.txt, spark_shell_output_2.10.txt


 When I run the spark-shell with the --jars argument and supply a path to a 
 single jar file, none of the classes in the jar are available in the REPL.
 I have encountered this same behaviour in both 1.3.1 and 1.4.0_RC-03 builds 
 for scala 2.11. I have yet to do a 1.4.0 RC-03 build for scala 2.10, but the 
 contents of the jar are available in the 1.3.1_2.10 REPL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-07-10 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622510#comment-14622510
 ] 

Iulian Dragos commented on SPARK-5281:
--

Thanks for pointing them out. Glad it wasn't too bad :)

 Registering table on RDD is giving MissingRequirementError
 --

 Key: SPARK-5281
 URL: https://issues.apache.org/jira/browse/SPARK-5281
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.1
Reporter: sarsol
Assignee: Iulian Dragos
Priority: Critical
 Fix For: 1.4.0


 Application crashes on this line  {{rdd.registerTempTable(temp)}}  in 1.2 
 version when using sbt or Eclipse SCALA IDE
 Stacktrace:
 {code}
 Exception in thread main scala.reflect.internal.MissingRequirementError: 
 class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
 primordial classloader with boot classpath 
 [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
  Files\Java\jre7\lib\resources.jar;C:\Program 
 Files\Java\jre7\lib\rt.jar;C:\Program 
 Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
 Files\Java\jre7\lib\jsse.jar;C:\Program 
 Files\Java\jre7\lib\jce.jar;C:\Program 
 Files\Java\jre7\lib\charsets.jar;C:\Program 
 Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
   at 
 scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
   at 
 scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
   at 
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
   at 
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
   at 
 scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
   at 
 com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
   at 
 scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
   at scala.App$$anonfun$main$1.apply(App.scala:71)
   at scala.App$$anonfun$main$1.apply(App.scala:71)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
   at scala.App$class.main(App.scala:71)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8974) The spark-dynamic-executor-allocation may be not supported

2015-07-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622101#comment-14622101
 ] 

Sean Owen commented on SPARK-8974:
--

Why do you say this means it's not supported? it sounds like it works, but are 
you saying there is a problem in error recovery? executor allocation should 
fail in this case. But it should succeed if the AM recovers.

 The spark-dynamic-executor-allocation may be not supported
 --

 Key: SPARK-8974
 URL: https://issues.apache.org/jira/browse/SPARK-8974
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: KaiXinXIaoLei
 Fix For: 1.4.1


 In yarn-client mode and config option spark.dynamicAllocation.enabled  is 
 true, when the state of ApplicationMaster is dead or disconnected, if the 
 tasks are submitted  before new ApplicationMaster start. The thread of 
 spark-dynamic-executor-allocation will throw exception, When 
 ApplicationMaster is running and not tasks are running, the number of 
 executor is not zero. So feture of dynamicAllocation are not  supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7977) Disallow println

2015-07-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7977:
-
Assignee: Jon Alter

 Disallow println
 

 Key: SPARK-7977
 URL: https://issues.apache.org/jira/browse/SPARK-7977
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Jon Alter
  Labels: starter
 Fix For: 1.5.0


 Very often we see pull requests that added println from debugging, but the 
 author forgot to remove it before code review.
 We can use the regex checker to disallow println. For legitimate use of 
 println, we can then disable the rule where they are used.
 Add to scalastyle-config.xml file:
 {code}
   check customId=println level=error 
 class=org.scalastyle.scalariform.TokenChecker enabled=true
 parametersparameter name=regex^println$/parameter/parameters
 customMessage![CDATA[Are you sure you want to println? If yes, wrap 
 the code block with 
   // scalastyle:off println
   println(...)
   // scalastyle:on println]]/customMessage
   /check
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8815) illegal java package names in jar

2015-07-10 Thread Sam Halliday (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622211#comment-14622211
 ] 

Sam Halliday commented on SPARK-8815:
-

Interesting. BTW, I see you're at ScalaX in December. I'll see you there! I 
gave a talk last year about high performance mathematics (i.e. netlib-java), 
but this year I'll be talking about generic programming.

 illegal java package names in jar
 -

 Key: SPARK-8815
 URL: https://issues.apache.org/jira/browse/SPARK-8815
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Sam Halliday
Priority: Minor

 In ENSIME we were unable to index the spark jars and we investigated 
 further... you have classes that look like this:
 org.spark-project.guava.annotations.VisibleForTesting
 Hyphens are not legal package names according to the java language spec, so 
 I'm amazed that this can actually be read at runtime... certainly no compiler 
 I know would allow it.
 What I suspect is happening is that you're using a build plugin that 
 internalises some of your dependencies and it is using your groupId but not 
 validating it... and then blindly using that name in the ASM manipulation.
 You might want to report this upstream with your build plugin.
 For your next release, I recommend using an explicit name that is not your 
 groupId. i.e. convert hyphens to underscores as Gosling recommends.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8975) Implement a mechanism to send a new rate from the driver to the block generator

2015-07-10 Thread Iulian Dragos (JIRA)
Iulian Dragos created SPARK-8975:


 Summary: Implement a mechanism to send a new rate from the driver 
to the block generator
 Key: SPARK-8975
 URL: https://issues.apache.org/jira/browse/SPARK-8975
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Iulian Dragos


Full design doc 
[here|https://docs.google.com/document/d/1ls_g5fFmfbbSTIfQQpUxH56d0f3OksF567zwA00zK9E/edit?usp=sharing]

- Add a new message, {{RateUpdate(newRate: Long)}} that ReceiverSupervisor 
handles in its endpoint 
- Add a new method to ReceiverTracker
{{def sendRateUpdate(streamId: Int, newRate: Long): Unit}}
this method sends an asynchronous RateUpdate message to the receiver supervisor 
corresponding to streamId 
- update the rate in the corresponding block generator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2015-07-10 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622342#comment-14622342
 ] 

RJ Nowling commented on SPARK-3644:
---

[~joshrosen] Thanks for pointing to the new JIRA! :)

 REST API for Spark application info (jobs / stages / tasks / storage info)
 --

 Key: SPARK-3644
 URL: https://issues.apache.org/jira/browse/SPARK-3644
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Web UI
Reporter: Josh Rosen
Assignee: Imran Rashid
 Fix For: 1.4.0


 This JIRA is a forum to draft a design proposal for a REST interface for 
 accessing information about Spark applications, such as job / stage / task / 
 storage status.
 There have been a number of proposals to serve JSON representations of the 
 information displayed in Spark's web UI.  Given that we might redesign the 
 pages of the web UI (and possibly re-implement the UI as a client of a REST 
 API), the API endpoints and their responses should be independent of what we 
 choose to display on particular web UI pages / layouts.
 Let's start a discussion of what a good REST API would look like from 
 first-principles.  We can discuss what urls / endpoints expose access to 
 data, how our JSON responses will be formatted, how fields will be named, how 
 the API will be documented and tested, etc.
 Some links for inspiration:
 https://developer.github.com/v3/
 http://developer.netflix.com/docs/REST_API_Reference
 https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8976) Python 3 crash: ValueError: invalid mode 'a+' (only r, w, b allowed)

2015-07-10 Thread Olivier Delalleau (JIRA)
Olivier Delalleau created SPARK-8976:


 Summary: Python 3 crash: ValueError: invalid mode 'a+' (only r, w, 
b allowed)
 Key: SPARK-8976
 URL: https://issues.apache.org/jira/browse/SPARK-8976
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: Windows 7
Reporter: Olivier Delalleau


See Github report: 
https://github.com/apache/spark/pull/5173#issuecomment-113410652



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8982) Worker hostnames not showing in Master web ui when launched with start-slaves.sh

2015-07-10 Thread Ben Zimmer (JIRA)
Ben Zimmer created SPARK-8982:
-

 Summary: Worker hostnames not showing in Master web ui when 
launched with start-slaves.sh
 Key: SPARK-8982
 URL: https://issues.apache.org/jira/browse/SPARK-8982
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Ben Zimmer
Priority: Minor


If a --host argument is not provided to Worker, WorkerArguments uses 
Utils.localHostName to find the host name. SPARK-6440 changed the functionality 
of Utils.localHostName to retrieve the local IP address instead of host name.

Since start-slave.sh does not provide the --host argument, clusters started 
with start-slaves.sh now show IP addresses instead of hostnames in the Master 
web UI. This is inconvenient when starting and debugging small clusters.

A simple fix would be to find the local machine's hostname in start-slave.sh 
and pass it as the --host argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8982) Worker hostnames not showing in Master web ui when launched with start-slaves.sh

2015-07-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622688#comment-14622688
 ] 

Apache Spark commented on SPARK-8982:
-

User 'bdzimmer' has created a pull request for this issue:
https://github.com/apache/spark/pull/7345

 Worker hostnames not showing in Master web ui when launched with 
 start-slaves.sh
 

 Key: SPARK-8982
 URL: https://issues.apache.org/jira/browse/SPARK-8982
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Ben Zimmer
Priority: Minor

 If a --host argument is not provided to Worker, WorkerArguments uses 
 Utils.localHostName to find the host name. SPARK-6440 changed the 
 functionality of Utils.localHostName to retrieve the local IP address instead 
 of host name.
 Since start-slave.sh does not provide the --host argument, clusters started 
 with start-slaves.sh now show IP addresses instead of hostnames in the Master 
 web UI. This is inconvenient when starting and debugging small clusters.
 A simple fix would be to find the local machine's hostname in start-slave.sh 
 and pass it as the --host argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8982) Worker hostnames not showing in Master web ui when launched with start-slaves.sh

2015-07-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8982:
---

Assignee: Apache Spark

 Worker hostnames not showing in Master web ui when launched with 
 start-slaves.sh
 

 Key: SPARK-8982
 URL: https://issues.apache.org/jira/browse/SPARK-8982
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Ben Zimmer
Assignee: Apache Spark
Priority: Minor

 If a --host argument is not provided to Worker, WorkerArguments uses 
 Utils.localHostName to find the host name. SPARK-6440 changed the 
 functionality of Utils.localHostName to retrieve the local IP address instead 
 of host name.
 Since start-slave.sh does not provide the --host argument, clusters started 
 with start-slaves.sh now show IP addresses instead of hostnames in the Master 
 web UI. This is inconvenient when starting and debugging small clusters.
 A simple fix would be to find the local machine's hostname in start-slave.sh 
 and pass it as the --host argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6447) Add quick-links to StagePage to jump to Accumulator/Task tables

2015-07-10 Thread Ryan Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Williams resolved SPARK-6447.
--
Resolution: Duplicate

 Add quick-links to StagePage to jump to Accumulator/Task tables
 ---

 Key: SPARK-6447
 URL: https://issues.apache.org/jira/browse/SPARK-6447
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Ryan Williams
Priority: Minor

 When there are many executors it is tedious to have to scroll down the page 
 to find the start of the Accumulators / Tasks tables.
 We should add links at the top of the page that jump to a URL fragment bound 
 to them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8986) GaussianMixture should take smoothing param

2015-07-10 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-8986:


 Summary: GaussianMixture should take smoothing param
 Key: SPARK-8986
 URL: https://issues.apache.org/jira/browse/SPARK-8986
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley


Gaussian mixture models should take a smoothing parameter which makes the 
algorithm robust against degenerate data or bad initializations.

Whomever takes this JIRA should look at other libraries (sklearn, R packages, 
Weka, etc.) to see how they do smoothing and what their API looks like.  Please 
summarize your findings here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7263) Add new shuffle manager which stores shuffle blocks in Parquet

2015-07-10 Thread Matt Massie (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622704#comment-14622704
 ] 

Matt Massie commented on SPARK-7263:


[~rxin] What are your thoughts? I'd like to keep moving this forward.

 Add new shuffle manager which stores shuffle blocks in Parquet
 --

 Key: SPARK-7263
 URL: https://issues.apache.org/jira/browse/SPARK-7263
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager
Reporter: Matt Massie

 I have a working prototype of this feature that can be viewed at
 https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1
 Setting the spark.shuffle.manager to parquet enables this shuffle manager.
 The dictionary support that Parquet provides appreciably reduces the amount of
 memory that objects use; however, once Parquet data is shuffled, all the
 dictionary information is lost and the column-oriented data is written to 
 shuffle
 blocks in a record-oriented fashion. This shuffle manager addresses this issue
 by reading and writing all shuffle blocks in the Parquet format.
 If shuffle objects are Avro records, then the Avro $SCHEMA is converted to 
 Parquet
 schema and used directly, otherwise, the Parquet schema is generated via 
 reflection.
 Currently, the only non-Avro keys supported is primitive types. The reflection
 code can be improved (or replaced) to support complex records.
 The ParquetShufflePair class allows the shuffle key and value to be stored in
 Parquet blocks as a single record with a single schema.
 This commit adds the following new Spark configuration options:
 spark.shuffle.parquet.compression - sets the Parquet compression codec
 spark.shuffle.parquet.blocksize - sets the Parquet block size
 spark.shuffle.parquet.pagesize - set the Parquet page size
 spark.shuffle.parquet.enabledictionary - turns dictionary encoding on/off
 Parquet does not (and has no plans to) support a streaming API. Metadata 
 sections
 are scattered through a Parquet file making a streaming API difficult. As 
 such,
 the ShuffleBlockFetcherIterator has been modified to fetch the entire contents
 of map outputs into temporary blocks before loading the data into the reducer.
 Interesting future asides:
 o There is no need to define a data serializer (although Spark requires it)
 o Parquet support predicate pushdown and projection which could be used at
   between shuffle stages to improve performance in the future



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8983) ML Tuning Cross-Validation Improvements

2015-07-10 Thread Feynman Liang (JIRA)
Feynman Liang created SPARK-8983:


 Summary: ML Tuning Cross-Validation Improvements
 Key: SPARK-8983
 URL: https://issues.apache.org/jira/browse/SPARK-8983
 Project: Spark
  Issue Type: Umbrella
  Components: ML
Reporter: Feynman Liang


This is an umbrella for grouping together various improvements to pipelines 
tuning features, centralizing developer communication, and encouraging code 
reuse and common interfaces.

We currently only support k-fold CV in {{CrossValidator}} while competing 
packages (e.g. [R caret|http://topepo.github.io/caret/splitting.html]) are much 
more feature rich, supporting balanced class labels, hold-out for time-series 
data, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-1301) Add UI elements to collapse Aggregated Metrics by Executor pane on stage page

2015-07-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-1301:
---

Assignee: (was: Apache Spark)

 Add UI elements to collapse Aggregated Metrics by Executor pane on stage 
 page
 ---

 Key: SPARK-1301
 URL: https://issues.apache.org/jira/browse/SPARK-1301
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Matei Zaharia
Priority: Minor
  Labels: Starter

 This table is useful but it takes up a lot of space on larger clusters, 
 hiding the more commonly accessed stage page. We could also move the table 
 below if collapsing it is difficult.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8069) Add support for cutoff to RandomForestClassifier

2015-07-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8069:
-
Description: 
Consider adding support for cutoffs similar to 
http://cran.r-project.org/web/packages/randomForest/randomForest.pdf 

(Joseph) I just wrote a [little design doc | 
https://docs.google.com/document/d/1nV6m7sqViHkEpawelq1S5_QLWWAouSlv81eiEEjKuJY/edit?usp=sharing]
 for this.

  was:Consider adding support for cutoffs similar to 
http://cran.r-project.org/web/packages/randomForest/randomForest.pdf 


 Add support for cutoff to RandomForestClassifier
 

 Key: SPARK-8069
 URL: https://issues.apache.org/jira/browse/SPARK-8069
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: holdenk
Priority: Minor
   Original Estimate: 240h
  Remaining Estimate: 240h

 Consider adding support for cutoffs similar to 
 http://cran.r-project.org/web/packages/randomForest/randomForest.pdf 
 (Joseph) I just wrote a [little design doc | 
 https://docs.google.com/document/d/1nV6m7sqViHkEpawelq1S5_QLWWAouSlv81eiEEjKuJY/edit?usp=sharing]
  for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8985) Create a test harness to improve Spark's combinatorial test coverage of non-default configuration

2015-07-10 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-8985:
-

 Summary: Create a test harness to improve Spark's combinatorial 
test coverage of non-default configuration
 Key: SPARK-8985
 URL: https://issues.apache.org/jira/browse/SPARK-8985
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Josh Rosen


Large numbers of Spark bugs could be caught by running a trivial set of 
end-to-end tests with a non-standard SparkConf configuration.

This ticket exists to assemble a list of such bugs and the configurations which 
would have caught them.  I think that we should build a separate Jenkins 
harness which runs end-to-end tests across a huge configuration matrix in order 
to detect these issues.  If the test configuration matrix grows to be too large 
to be tested daily, then we can explore combinatorial testing approaches to 
test fewer configurations while still achieving a high level of combinatorial 
coverage.

**Bugs listed in order of the test configurations which would have caught 
them:**

* spark.python.worker.reuse=false:
   ** SPARK-8976



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7210) Test matrix decompositions for speed vs. numerical stability for Gaussians

2015-07-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622737#comment-14622737
 ] 

Joseph K. Bradley commented on SPARK-7210:
--

More thoughts from Reza: We should consider degenerate cases, and to say we 
handle them correctly, we can compare with R as a reasonable gold standard.  
E.g., how does it handle normal PDFs when the covariance matrix is not full 
rank?

Relatedly, we should add a smoothing parameter to GaussianMixture.  That might 
actually be higher priority than this JIRA.  I'll make a JIRA for that and link 
it from the umbrella.

 Test matrix decompositions for speed vs. numerical stability for Gaussians
 --

 Key: SPARK-7210
 URL: https://issues.apache.org/jira/browse/SPARK-7210
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor

 We currently use SVD for inverting the Gaussian's covariance matrix and 
 computing the determinant.  SVD is numerically stable but slow.  We could 
 experiment with Cholesky, etc. to figure out a better option, or a better 
 option for certain settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8976) Python 3 crash: ValueError: invalid mode 'a+' (only r, w, b allowed)

2015-07-10 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622623#comment-14622623
 ] 

Josh Rosen commented on SPARK-8976:
---

I think that this problem is Windows-specific.  The code near line 149 of 
worker.py will typically not be executed on non-Windows machines as long as 
{{spark.python.worker.reuse=true}} (the default).

I think the right fix is adding a regression test which tries running simple 
PySpark jobs with {{spark.python.worker.reuse=false}}, then fixing the 
underlying bug by passing rwb instead of a+.

If we get a regression test working on Jenkins, then we'll be able to verify 
that the fix is safe for Python 2 and 3 because Jenkins tests both of those 
Python versions.

Would you like to submit a pull request for this?  I'd do it myself but I'm a 
bit swamped with other work right now.

 Python 3 crash: ValueError: invalid mode 'a+' (only r, w, b allowed)
 

 Key: SPARK-8976
 URL: https://issues.apache.org/jira/browse/SPARK-8976
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: Windows 7
Reporter: Olivier Delalleau

 See Github report: 
 https://github.com/apache/spark/pull/5173#issuecomment-113410652



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC

2015-07-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622666#comment-14622666
 ] 

Paweł Kopiczko commented on SPARK-8981:
---

Sure. I believe that when executor is spawned it has access to `appName` and 
`applicationId` properties of `SparkContext` instance. I'd like it to put these 
values in MDC 
https://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/MDC.html#put(java.lang.String,
 java.lang.Object). Having those values in MDC and setting `%X{appName}` and 
`%X{applicationId}` in log4j's PatternLayout would allow filtering out specific 
application logs from a single file. Does that make sense?

 Set applicationId and appName in log4j MDC
 --

 Key: SPARK-8981
 URL: https://issues.apache.org/jira/browse/SPARK-8981
 Project: Spark
  Issue Type: New Feature
Reporter: Paweł Kopiczko
Priority: Minor

 It would be nice to have, because it's good to have logs in one file when 
 using log agents (like logentires) in standalone mode. Also allows 
 configuring rolling file appender without a mess when multiple applications 
 are running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8981) Set applicationId and appName in log4j MDC

2015-07-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622666#comment-14622666
 ] 

Paweł Kopiczko edited comment on SPARK-8981 at 7/10/15 6:00 PM:


Sure. I believe that when executor is spawned it has access to appName and 
applicationId properties of `SparkContext` instance. I'd like it to put these 
values in MDC 
https://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/MDC.html#put(java.lang.String,
 java.lang.Object). Having those values in MDC and setting %X{appName} and 
%X{applicationId} in log4j's PatternLayout would allow filtering out specific 
application logs from a single file. Does that make sense?


was (Author: kopiczko):
Sure. I believe that when executor is spawned it has access to `appName` and 
`applicationId` properties of `SparkContext` instance. I'd like it to put these 
values in MDC 
https://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/MDC.html#put(java.lang.String,
 java.lang.Object). Having those values in MDC and setting `%X{appName}` and 
`%X{applicationId}` in log4j's PatternLayout would allow filtering out specific 
application logs from a single file. Does that make sense?

 Set applicationId and appName in log4j MDC
 --

 Key: SPARK-8981
 URL: https://issues.apache.org/jira/browse/SPARK-8981
 Project: Spark
  Issue Type: New Feature
Reporter: Paweł Kopiczko
Priority: Minor

 It would be nice to have, because it's good to have logs in one file when 
 using log agents (like logentires) in standalone mode. Also allows 
 configuring rolling file appender without a mess when multiple applications 
 are running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8981) Set applicationId and appName in log4j MDC

2015-07-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622666#comment-14622666
 ] 

Paweł Kopiczko edited comment on SPARK-8981 at 7/10/15 6:04 PM:


Sure. I believe that when executor is spawned it has access to {{appName}} and 
{{applicationId}} properties of `SparkContext` instance. I'd like it to put 
these values in MDC 
https://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/MDC.html#put(java.lang.String,
 java.lang.Object). Having those values in MDC and setting %X\{appName\} and 
%X\{applicationId\} in log4j's PatternLayout would allow filtering out specific 
application logs from a single file. Does that make sense?


was (Author: kopiczko):
Sure. I believe that when executor is spawned it has access to appName and 
applicationId properties of `SparkContext` instance. I'd like it to put these 
values in MDC 
https://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/MDC.html#put(java.lang.String,
 java.lang.Object). Having those values in MDC and setting %X{appName} and 
%X{applicationId} in log4j's PatternLayout would allow filtering out specific 
application logs from a single file. Does that make sense?

 Set applicationId and appName in log4j MDC
 --

 Key: SPARK-8981
 URL: https://issues.apache.org/jira/browse/SPARK-8981
 Project: Spark
  Issue Type: New Feature
Reporter: Paweł Kopiczko
Priority: Minor

 It would be nice to have, because it's good to have logs in one file when 
 using log agents (like logentires) in standalone mode. Also allows 
 configuring rolling file appender without a mess when multiple applications 
 are running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-1301) Add UI elements to collapse Aggregated Metrics by Executor pane on stage page

2015-07-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-1301:
---

Assignee: Apache Spark

 Add UI elements to collapse Aggregated Metrics by Executor pane on stage 
 page
 ---

 Key: SPARK-1301
 URL: https://issues.apache.org/jira/browse/SPARK-1301
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Matei Zaharia
Assignee: Apache Spark
Priority: Minor
  Labels: Starter

 This table is useful but it takes up a lot of space on larger clusters, 
 hiding the more commonly accessed stage page. We could also move the table 
 below if collapsing it is difficult.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8069) Add support for cutoff to RandomForestClassifier

2015-07-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8069:
-
Assignee: holdenk

 Add support for cutoff to RandomForestClassifier
 

 Key: SPARK-8069
 URL: https://issues.apache.org/jira/browse/SPARK-8069
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: holdenk
Assignee: holdenk
Priority: Minor
   Original Estimate: 240h
  Remaining Estimate: 240h

 Consider adding support for cutoffs similar to 
 http://cran.r-project.org/web/packages/randomForest/randomForest.pdf 
 (Joseph) I just wrote a [little design doc | 
 https://docs.google.com/document/d/1nV6m7sqViHkEpawelq1S5_QLWWAouSlv81eiEEjKuJY/edit?usp=sharing]
  for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8984) Developer documentation for ML Pipelines

2015-07-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622745#comment-14622745
 ] 

Joseph K. Bradley commented on SPARK-8984:
--

Linked attribute doc JIRA since attributes will be important (and fairly 
complex) for developers.

 Developer documentation for ML Pipelines
 

 Key: SPARK-8984
 URL: https://issues.apache.org/jira/browse/SPARK-8984
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, ML
Reporter: Feynman Liang
Priority: Minor

 This issue will track work on developer-specific documentation for the ML 
 Pipelines API. The goal is to provide documentation for how to write custom 
 estimators/transformers, various concepts (e.g. Params, attributes) and the 
 rationale behind design decisions.
 We do not aim to duplicate the [ML programming 
 guide|http://spark.apache.org/docs/latest/ml-guide.html]. Rather, the target 
 audience is developers and contributors to ML pipelines.
 Documentation is currently read-only on [Google 
 Docs|https://docs.google.com/document/d/1rRc2o8AIH2Y4U8A7P3yopbT-fAXV1u2UO3wBAQ-vQYM/edit?usp=sharing].
  Please ask if you would like to contribute.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3155) Support DecisionTree pruning

2015-07-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622529#comment-14622529
 ] 

Joseph K. Bradley commented on SPARK-3155:
--

I don't know if there is a nice paper explaining the implementation, but I do 
know it's quite standard based on hearsay, so I suspect there are papers or 
docs explaining it.

The issue is still very relevant.

No one is implementing the feature as far as I know.  However, do be aware of 
[SPARK-7131], which has an open PR (to be merged soon, I hope).

 Support DecisionTree pruning
 

 Key: SPARK-3155
 URL: https://issues.apache.org/jira/browse/SPARK-3155
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 Improvement: accuracy, computation
 Summary: Pruning is a common method for preventing overfitting with decision 
 trees.  A smart implementation can prune the tree during training in order to 
 avoid training parts of the tree which would be pruned eventually anyways.  
 DecisionTree does not currently support pruning.
 Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
 with zero or more branches removed.
 A naive implementation prunes as follows:
 (1) Train a depth K tree using a training set.
 (2) Compute the optimal prediction at each node (including internal nodes) 
 based on the training set.
 (3) Take a held-out validation set, and use the tree to make predictions for 
 each validation example.  This allows one to compute the validation error 
 made at each node in the tree (based on the predictions computed in step (2).)
 (4) For each pair of leafs with the same parent, compare the total error on 
 the validation set made by the leafs’ predictions with the error made by the 
 parent’s predictions.  Remove the leafs if the parent has lower error.
 A smarter implementation prunes during training, computing the error on the 
 validation set made by each node as it is trained.  Whenever two children 
 increase the validation error, they are pruned, and no more training is 
 required on that branch.
 It is common to use about 1/3 of the data for pruning.  Note that pruning is 
 important when using a tree directly for prediction.  It is less important 
 when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >