date:20150412


 [ 
https://issues.apache.org/jira/browse/SPARK-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6199:
---
Assignee: (was: Cheng Hao)

 Support CTE
 ---

 Key: SPARK-6199
 URL: https://issues.apache.org/jira/browse/SPARK-6199
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: haiyang
 Fix For: 1.4.0


 Support CTE in SQLContext and HiveContext



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6875) Add support for Joda-time types

2015-04-12 Thread Patrick Grandjean (JIRA)

Patrick Grandjean created SPARK-6875:


 Summary: Add support for Joda-time types
 Key: SPARK-6875
 URL: https://issues.apache.org/jira/browse/SPARK-6875
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Patrick Grandjean


The need comes from the following use case:

val objs: RDD[MyClass] = [...]

val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC._

objs.saveAsParquetFile(parquet)

MyClass contains joda-time fields. When saving to parquet file, an exception is 
thrown (matchError in ScalaReflection.scala).

Spark SQL supports java SQL date/time types. This request is to add support for 
Joda-time types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6849) The constructor of GradientDescent should be public

2015-04-12 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491861#comment-14491861
 ] 

Guoqiang Li commented on SPARK-6849:


[~srowen]
https://github.com/cloudml/zen

 The constructor of GradientDescent should be public
 ---

 Key: SPARK-6849
 URL: https://issues.apache.org/jira/browse/SPARK-6849
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Guoqiang Li
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6849) The constructor of GradientDescent should be public


[ 
https://issues.apache.org/jira/browse/SPARK-6849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491887#comment-14491887
 ] 

Joseph K. Bradley commented on SPARK-6849:
--

It would be great to open up the optimization APIs, but I think we should clean 
them up before making them public.  (Alternatively, we could make them public 
but mark them all as Experimental.)  I hope we can figure out what cleanups are 
needed here: [https://issues.apache.org/jira/browse/SPARK-5256]

 The constructor of GradientDescent should be public
 ---

 Key: SPARK-6849
 URL: https://issues.apache.org/jira/browse/SPARK-6849
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.1
Reporter: Guoqiang Li
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6545) Minor changes for CompactBuffer

2015-04-12 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491914#comment-14491914
 ] 

Cheng Hao commented on SPARK-6545:
--

Thank you [~srowen], we should close this for now, I will reopen it when I have 
more general idea for the updating.

 Minor changes for CompactBuffer
 ---

 Key: SPARK-6545
 URL: https://issues.apache.org/jira/browse/SPARK-6545
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor

 HashedRelation should always return a Not null CompactBuffer, which will be 
 helpful for the further improvement of Multiway Join



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6643) Python API for StandardScalerModel


 [ 
https://issues.apache.org/jira/browse/SPARK-6643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6643:
-
Assignee: Kai Sasaki

 Python API for StandardScalerModel
 --

 Key: SPARK-6643
 URL: https://issues.apache.org/jira/browse/SPARK-6643
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Assignee: Kai Sasaki
Priority: Minor
  Labels: mllib, python
 Fix For: 1.4.0


 This is the sub-task of SPARK-6254.
 Wrap missing method for {{StandardScalerModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6643) Python API for StandardScalerModel


 [ 
https://issues.apache.org/jira/browse/SPARK-6643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6643.
--
Resolution: Fixed

Issue resolved by pull request 5310
[https://github.com/apache/spark/pull/5310]

 Python API for StandardScalerModel
 --

 Key: SPARK-6643
 URL: https://issues.apache.org/jira/browse/SPARK-6643
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Priority: Minor
  Labels: mllib, python
 Fix For: 1.4.0


 This is the sub-task of SPARK-6254.
 Wrap missing method for {{StandardScalerModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-765) Test suite should run Spark example programs

2015-04-12 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491941#comment-14491941
 ] 

Yu Ishikawa commented on SPARK-765:
---

[~joshrosen] sorry, one more thing. Are we allowed to add test suites for 
spark.examples?
We are discussing deplicating static train() method in Scala/Java on 
SPARK-6682. I think it is a good timing to add test suites to spark.examples.

 Test suite should run Spark example programs
 

 Key: SPARK-765
 URL: https://issues.apache.org/jira/browse/SPARK-765
 Project: Spark
  Issue Type: New Feature
  Components: Examples
Reporter: Josh Rosen

 The Spark test suite should also run each of the Spark example programs (the 
 PySpark suite should do the same).  This should be done through a shell 
 script or other mechanism to simulate the environment setup used by end users 
 that run those scripts.
 This would prevent problems like SPARK-764 from making it into releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6765) Turn scalastyle on for test code


[ 
https://issues.apache.org/jira/browse/SPARK-6765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491765#comment-14491765
 ] 

Apache Spark commented on SPARK-6765:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5484

 Turn scalastyle on for test code
 

 Key: SPARK-6765
 URL: https://issues.apache.org/jira/browse/SPARK-6765
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra, Tests
Reporter: Reynold Xin
Assignee: Reynold Xin

 We should turn scalastyle on for test code. Test code should be as important 
 as main code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-04-12 Thread Yi Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491864#comment-14491864
 ] 

Yi Zhou edited comment on SPARK-5791 at 4/13/15 2:57 AM:
-

We changed file format from ORC to Parquet and test based the latest spark 
code(1.4.0-SNAPSHOT).  Got the result like below:
Spark SQL(2m28s) vs. Hive (3m12s)


was (Author: jameszhouyi):
We changed file format from ORC to Parquet.  Got the result like below:
Spark SQL(2m28s) vs. Hive (3m12s)

 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou
 Attachments: Physcial_Plan_Hive.txt, 
 Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt


 Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-04-12 Thread Yi Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491864#comment-14491864
 ] 

Yi Zhou commented on SPARK-5791:


We changed file format from ORC to Parquet.  Got the result like below:
Spark SQL(2m28s) vs. Hive (3m12s)

 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou
 Attachments: Physcial_Plan_Hive.txt, 
 Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt


 Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6847) Stack overflow on updateStateByKey which followed by a dstream with checkpoint set

2015-04-12 Thread Jack Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491873#comment-14491873
 ] 

Jack Hu commented on SPARK-6847:


Hi, [~sowen]

I tested more cases:
# only change the {{newlist.headOption.orElse(oldstate)}} to {{Some(a)}}, the 
issue still exists
# only change the streaming batch interval to {{2 seconds}}, keep the  
{{newlist.headOption.orElse(oldstate)}} and checkpoint interval 10 seconds, the 
issue does not exist. 

So this issue may related to the checkpoint interval and batch interval. 

 Stack overflow on updateStateByKey which followed by a dstream with 
 checkpoint set
 --

 Key: SPARK-6847
 URL: https://issues.apache.org/jira/browse/SPARK-6847
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Jack Hu
  Labels: StackOverflowError, Streaming

 The issue happens with the following sample code: uses {{updateStateByKey}} 
 followed by a {{map}} with checkpoint interval 10 seconds
 {code}
 val sparkConf = new SparkConf().setAppName(test)
 val streamingContext = new StreamingContext(sparkConf, Seconds(10))
 streamingContext.checkpoint(checkpoint)
 val source = streamingContext.socketTextStream(localhost, )
 val updatedResult = source.map(
 (1,_)).updateStateByKey(
 (newlist : Seq[String], oldstate : Option[String]) = 
 newlist.headOption.orElse(oldstate))
 updatedResult.map(_._2)
 .checkpoint(Seconds(10))
 .foreachRDD((rdd, t) = {
   println(Deep:  + rdd.toDebugString.split(\n).length)
   println(t.toString() + :  + rdd.collect.length)
 })
 streamingContext.start()
 streamingContext.awaitTermination()
 {code}
 From the output, we can see that the dependency will be increasing time over 
 time, the {{updateStateByKey}} never get check-pointed,  and finally, the 
 stack overflow will happen. 
 Note:
 * The rdd in {{updatedResult.map(_._2)}} get check-pointed in this case, but 
 not the {{updateStateByKey}} 
 * If remove the {{checkpoint(Seconds(10))}} from the map result ( 
 {{updatedResult.map(_._2)}} ), the stack overflow will not happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6847) Stack overflow on updateStateByKey which followed by a dstream with checkpoint set

2015-04-12 Thread Jack Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491873#comment-14491873
 ] 

Jack Hu edited comment on SPARK-6847 at 4/13/15 3:34 AM:
-

Hi, [~sowen]

I tested more cases:
# only change the {{newlist.headOption.orElse(oldstate)}} to {{Some(a)}}, the 
issue still exists
# only change the streaming batch interval to {{2 seconds}}, keep the  
{{newlist.headOption.orElse(oldstate)}} and checkpoint interval 10 seconds, the 
issue does not exist. 

So this issue may be related to the checkpoint interval and batch interval. 


was (Author: jhu):
Hi, [~sowen]

I tested more cases:
# only change the {{newlist.headOption.orElse(oldstate)}} to {{Some(a)}}, the 
issue still exists
# only change the streaming batch interval to {{2 seconds}}, keep the  
{{newlist.headOption.orElse(oldstate)}} and checkpoint interval 10 seconds, the 
issue does not exist. 

So this issue may related to the checkpoint interval and batch interval. 

 Stack overflow on updateStateByKey which followed by a dstream with 
 checkpoint set
 --

 Key: SPARK-6847
 URL: https://issues.apache.org/jira/browse/SPARK-6847
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Jack Hu
  Labels: StackOverflowError, Streaming

 The issue happens with the following sample code: uses {{updateStateByKey}} 
 followed by a {{map}} with checkpoint interval 10 seconds
 {code}
 val sparkConf = new SparkConf().setAppName(test)
 val streamingContext = new StreamingContext(sparkConf, Seconds(10))
 streamingContext.checkpoint(checkpoint)
 val source = streamingContext.socketTextStream(localhost, )
 val updatedResult = source.map(
 (1,_)).updateStateByKey(
 (newlist : Seq[String], oldstate : Option[String]) = 
 newlist.headOption.orElse(oldstate))
 updatedResult.map(_._2)
 .checkpoint(Seconds(10))
 .foreachRDD((rdd, t) = {
   println(Deep:  + rdd.toDebugString.split(\n).length)
   println(t.toString() + :  + rdd.collect.length)
 })
 streamingContext.start()
 streamingContext.awaitTermination()
 {code}
 From the output, we can see that the dependency will be increasing time over 
 time, the {{updateStateByKey}} never get check-pointed,  and finally, the 
 stack overflow will happen. 
 Note:
 * The rdd in {{updatedResult.map(_._2)}} get check-pointed in this case, but 
 not the {{updateStateByKey}} 
 * If remove the {{checkpoint(Seconds(10))}} from the map result ( 
 {{updatedResult.map(_._2)}} ), the stack overflow will not happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6151) schemaRDD to parquetfile with saveAsParquetFile control the HDFS block size

2015-04-12 Thread Littlestar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491881#comment-14491881
 ] 

Littlestar commented on SPARK-6151:
---

The HDFS Block Size is set once when you first install Hadoop.
blockSize can be changed when File create.
FSDataOutputStream org.apache.hadoop.fs.FileSystem.create(Path f, boolean 
overwrite, int bufferSize, short replication, long blockSize) throws IOException



 schemaRDD to parquetfile with saveAsParquetFile control the HDFS block size
 ---

 Key: SPARK-6151
 URL: https://issues.apache.org/jira/browse/SPARK-6151
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1
Reporter: Littlestar
Priority: Trivial

 How schemaRDD to parquetfile with saveAsParquetFile control the HDFS block 
 size. may be Configuration need.
 related question by others.
 http://apache-spark-user-list.1001560.n3.nabble.com/HDFS-block-size-for-parquet-output-tt21183.html
 http://qnalist.com/questions/5054892/spark-sql-parquet-and-impala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-12 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491905#comment-14491905
 ] 

Yu Ishikawa commented on SPARK-6682:


[~josephkb] sounds great. As you're suggesting, we should gradually tackle each 
algorithm one by one.

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6562) DataFrame.na.replace value support


 [ 
https://issues.apache.org/jira/browse/SPARK-6562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6562.

   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Reynold Xin

 DataFrame.na.replace value support
 --

 Key: SPARK-6562
 URL: https://issues.apache.org/jira/browse/SPARK-6562
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.4.0


 Support replacing a set of values with another set of values (i.e. map join), 
 similar to Pandas' replace.
 http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6858) Register Java HashMap for SparkSqlSerializer


 [ 
https://issues.apache.org/jira/browse/SPARK-6858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6858:
---
Assignee: Liang-Chi Hsieh

 Register Java HashMap for SparkSqlSerializer
 

 Key: SPARK-6858
 URL: https://issues.apache.org/jira/browse/SPARK-6858
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Trivial
 Fix For: 1.4.0


 Since now kyro serializer is used for {{GeneralHashedRelation}} whether kyro 
 is enabled or not, it is better to register Java {{HashMap}} in 
 {{SparkSqlSerializer}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4760) ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files


 [ 
https://issues.apache.org/jira/browse/SPARK-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4760.

Resolution: Not A Problem

 ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size 
 for tables created from Parquet files
 --

 Key: SPARK-4760
 URL: https://issues.apache.org/jira/browse/SPARK-4760
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jianshi Huang
Priority: Critical
 Fix For: 1.3.0


 In a older Spark version built around Oct. 12, I was able to use 
   ANALYZE TABLE table COMPUTE STATISTICS noscan
 to get estimated table size, which is important for optimizing joins. (I'm 
 joining 15 small dimension tables, and this is crucial to me).
 In the more recent Spark builds, it fails to estimate the table size unless I 
 remove noscan.
 Here's the statistics I got using DESC EXTENDED:
 old:
 parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166}
 new:
 parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, 
 COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
 And I've tried turning off spark.sql.hive.convertMetastoreParquet in my 
 spark-defaults.conf and the result is unaffected (in both versions).
 Looks like the Parquet support in new Hive (0.13.1) is broken?
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6611) Add support for INTEGER as synonym of INT to DDLParser


 [ 
https://issues.apache.org/jira/browse/SPARK-6611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6611:
---
Assignee: Santiago M. Mola

 Add support for INTEGER as synonym of INT to DDLParser
 --

 Key: SPARK-6611
 URL: https://issues.apache.org/jira/browse/SPARK-6611
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola
Assignee: Santiago M. Mola
Priority: Minor
 Fix For: 1.4.0


 Add support for INTEGER as synonym of INT to DDLParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-04-12 Thread Kannan Rajah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491834#comment-14491834
 ] 

Kannan Rajah commented on SPARK-1529:
-

Thanks. FYI, I have pushed few more commits to my repo to handle all the TODOs 
and bug fixes. So you can track this branch for all the changes: 
https://github.com/rkannan82/spark/commits/dfs_shuffle

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kannan Rajah
 Attachments: Spark Shuffle using HDFS.pdf


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1227) Diagnostics for ClassificationRegression


[ 
https://issues.apache.org/jira/browse/SPARK-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491895#comment-14491895
 ] 

Joseph K. Bradley commented on SPARK-1227:
--

I agree it will be nice to provide loss classes.  Even though *Metrics classes 
exist already, loss classes might be nice as we provide more functionality for 
diagnosis during learning (e.g., for early stopping, model selection, etc.).  
Added link to related JIRA on optimization APIs.

 Diagnostics for ClassificationRegression
 -

 Key: SPARK-1227
 URL: https://issues.apache.org/jira/browse/SPARK-1227
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Martin Jaggi
Assignee: Martin Jaggi

 Currently, the attained objective function is not computed (for efficiency 
 reasons, as one evaluation requires one full pass through the data).
 For diagnostics and comparing different algorithms, we should however provide 
 this as a separate function (one MR).
 Doing this requires the loss and regularizer functions themselves, not only 
 their gradients (which are currently in the Gradient class). How about adding 
 the new function directly on the corresponding models in classification/* and 
 regression/* ? Any thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-12 Thread Michael Kuhlen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491919#comment-14491919
 ] 

Michael Kuhlen commented on SPARK-3727:
---

Hello!

I've implemented predictWithProbabilities() methods for DecisionTreeModel and 
treeEnsembleModels in scala. These methods return both the most likely class as 
well as the probabilities of each of the classes. As in scikit-learn, the 
probabilities are defined as the mean predicted class probabilities of the 
trees in the forest\[, where the\] class probability of a single tree is the 
fraction of samples of the same class in a leaf. 
([sklearn.ensemble.RandomForestClassifier.predict_proba|http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba])

My approach was to modify the Predict class to hold the class probabilities for 
all classes (as opposed to just of the majority class), and I utilize these 
probabilities to determine the means over all trees. I believe this should work 
for GBTrees as well, since I'm taking care to weight the probabilities by the 
weight of each tree (=1.0 for RandomForest).

Here's a [link to my 
fork|https://github.com/apache/spark/compare/master...mqk:master] showing my 
modifications. I would be happy to issue a pull request for these changes, if 
that would be of interest to the community. Although I haven't done so yet, I 
believe it should be straightforward to extend this to also calculate the 
variance of estimates for regression algorithms, as suggested in this ticket.

Best, 

Mike


 DecisionTree, RandomForest: More prediction functionality
 -

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

[
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491863#comment-14491863
]

Patrick Wendell commented on SPARK-1529:

Hey Kannan,

We originally considered doing something like you are proposing, where we would
change our filesystem interactions to all use a Hadoop FileSystem class and
then we'd use Hadoop's LocalFileSystem. However, there were two issues:

1. We used POSIX API's that are not present in Hadoop. For instance, we use
memory mapping in various places, FileChannel in the BlockObjectWriter, etc.
2. Using LocalFileSystem has a substantial performance overheads compared with
our current code. So we'd have to write our own implementation of a Local
filesystem.

For this reason, we decided that our current shuffle machinery was
fundamentally not usable for non-POSIX environments. So we decided that
instead, we'd let people customize shuffle behavior at a higher level and we
implemented the pluggable shuffle components. So you can create a shuffle
manager that is specifically optimized for a particular environment (e.g. MapR).

Did you consider implementing a MapR shuffle using that mechanism instead?
You'd have to operate at a higher level, where you reason about shuffle
records, etc. But you'd have a lot of flexibility to optimize within that.

Support setting spark.local.dirs to a hadoop FileSystem

Key: SPARK-1529
URL: https://issues.apache.org/jira/browse/SPARK-1529
Project: Spark
Issue Type: Improvement
Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kannan Rajah
Attachments: Spark Shuffle using HDFS.pdf

In some environments, like with MapR, local volumes are accessed through the
Hadoop filesystem interface. We should allow setting spark.local.dir to a
Hadoop filesystem location.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4081) Categorical feature indexing


 [ 
https://issues.apache.org/jira/browse/SPARK-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4081.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 3000
[https://github.com/apache/spark/pull/3000]

 Categorical feature indexing
 

 Key: SPARK-4081
 URL: https://issues.apache.org/jira/browse/SPARK-4081
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor
 Fix For: 1.4.0


 DecisionTree and RandomForest require that categorical features and labels be 
 indexed 0,1,2  There is currently no code to aid with indexing a dataset. 
  This is a proposal for a helper class for computing indices (and also 
 deciding which features to treat as categorical).
 Proposed functionality:
 * This helps process a dataset of unknown vectors into a dataset with some 
 continuous features and some categorical features. The choice between 
 continuous and categorical is based upon a maxCategories parameter.
 * This can also map categorical feature values to 0-based indices.
 Usage:
 {code}
 val myData1: RDD[Vector] = ...
 val myData2: RDD[Vector] = ...
 val datasetIndexer = new DatasetIndexer(maxCategories)
 datasetIndexer.fit(myData1)
 val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
 datasetIndexer.fit(myData2)
 val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
 val categoricalFeaturesInfo: Map[Double, Int] = 
 datasetIndexer.getCategoricalFeatureIndexes()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6869) Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node


 [ 
https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6869:
-
Component/s: PySpark

 Pass PYTHONPATH to executor, so that executor can read pyspark file from 
 local file system on executor node
 ---

 Key: SPARK-6869
 URL: https://issues.apache.org/jira/browse/SPARK-6869
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Weizhong
Priority: Minor

 From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the 
 assembly jar are package by JDK 1.7+, so pass the PYTHONPATH(set in 
 spark-env.sh) to executor so that executor python process can read pyspark 
 file from local file system rather than from assembly jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted


 [ 
https://issues.apache.org/jira/browse/SPARK-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6870:
-
Component/s: YARN

 Catch InterruptedException when yarn application state monitor thread been 
 interrupted
 --

 Key: SPARK-6870
 URL: https://issues.apache.org/jira/browse/SPARK-6870
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Weizhong
Priority: Minor

 On PR #5305 we interrupt the monitor thread but forget to catch the 
 InterruptedException, then in the log will print the stack info, so we need 
 to catch it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-04-12 Thread Kannan Rajah (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491869#comment-14491869
]

Kannan Rajah commented on SPARK-1529:
-

[~pwendell] The default code path still uses the FileChannel, memory mapping
techniques. I just provided an abstraction called FileSystem.scala (not
Hadoop's FileSystem.java). LocalFileSystem.scala delegates the call to existing
Spark code path that uses FileChannel. I am using Hadoop's RawLocalFileSystem
class just to get an InputStream, OutputStream. And this internally also uses
FileChannel. Please see RawLocalFileSystem.LocalFSFileInputStream. It is just a
wrapper on java.io.FileInputStream.

Going back to why I considered this approach. It will allow us to reuse all the
logic currently used by SortShuffle code path. We would have to implement
pretty much everything that's been done by Spark to do the shuffle on HDFS. We
are in the processing of running some performance tests to understand the
impact of the change. One of the main things we will be verifying is if there
is any performance degradation introduced in the default code path and fix if
there is any. Is this acceptable?

Support setting spark.local.dirs to a hadoop FileSystem

In some environments, like with MapR, local volumes are accessed through the
Hadoop filesystem interface. We should allow setting spark.local.dir to a
Hadoop filesystem location.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6765) Turn scalastyle on for test code


[ 
https://issues.apache.org/jira/browse/SPARK-6765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491893#comment-14491893
 ] 

Apache Spark commented on SPARK-6765:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5486

 Turn scalastyle on for test code
 

 Key: SPARK-6765
 URL: https://issues.apache.org/jira/browse/SPARK-6765
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra, Tests
Reporter: Reynold Xin
Assignee: Reynold Xin

 We should turn scalastyle on for test code. Test code should be as important 
 as main code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs


[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491891#comment-14491891
 ] 

Joseph K. Bradley commented on SPARK-5256:
--

Added link to [SPARK-1227], which discusses ML diagnostics and brings up the 
question of what loss functions should be provided as Loss classes rather than 
via the ClassificationMetrics and RegressionMetrics classes.

 Improving MLlib optimization APIs
 -

 Key: SPARK-5256
 URL: https://issues.apache.org/jira/browse/SPARK-5256
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 *Goal*: Improve APIs for optimization
 *Motivation*: There have been several disjoint mentions of improving the 
 optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
 a place to discuss what API changes are necessary for the long term, and to 
 provide links to other relevant JIRAs.
 Eventually, I hope this leads to a design doc outlining:
 * current issues
 * requirements such as supporting many types of objective functions, 
 optimization algorithms, and parameters to those algorithms
 * ideal API
 * breakdown of smaller JIRAs needed to achieve that API
 I will soon create an initial design doc, and I will try to watch this JIRA 
 and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-12 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491906#comment-14491906
 ] 

Yu Ishikawa commented on SPARK-6682:


[~avulanov] thank you for your answer. And I understand SPARK-5256 blocks this 
issue.

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-4760) ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files


 [ 
https://issues.apache.org/jira/browse/SPARK-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-4760:


 ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size 
 for tables created from Parquet files
 --

 Key: SPARK-4760
 URL: https://issues.apache.org/jira/browse/SPARK-4760
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jianshi Huang
Priority: Critical
 Fix For: 1.3.0


 In a older Spark version built around Oct. 12, I was able to use 
   ANALYZE TABLE table COMPUTE STATISTICS noscan
 to get estimated table size, which is important for optimizing joins. (I'm 
 joining 15 small dimension tables, and this is crucial to me).
 In the more recent Spark builds, it fails to estimate the table size unless I 
 remove noscan.
 Here's the statistics I got using DESC EXTENDED:
 old:
 parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166}
 new:
 parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, 
 COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
 And I've tried turning off spark.sql.hive.convertMetastoreParquet in my 
 spark-defaults.conf and the result is unaffected (in both versions).
 Looks like the Parquet support in new Hive (0.13.1) is broken?
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6179) Support SHOW PRINCIPALS role_name;


 [ 
https://issues.apache.org/jira/browse/SPARK-6179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6179:
---
Assignee: Zhongshuai Pei

 Support SHOW PRINCIPALS role_name;
 

 Key: SPARK-6179
 URL: https://issues.apache.org/jira/browse/SPARK-6179
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1
Reporter: Zhongshuai Pei
Assignee: Zhongshuai Pei
 Fix For: 1.4.0


 SHOW PRINCIPALS role_name;
 Lists all roles and users who belong to this role.
 Only the admin role has privilege for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6199) Support CTE


 [ 
https://issues.apache.org/jira/browse/SPARK-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6199:
---
Assignee: Cheng Hao

 Support CTE
 ---

 Key: SPARK-6199
 URL: https://issues.apache.org/jira/browse/SPARK-6199
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: haiyang
Assignee: Cheng Hao
 Fix For: 1.4.0


 Support CTE in SQLContext and HiveContext



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-12 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491877#comment-14491877
 ] 

Ilya Ganelin commented on SPARK-6703:
-

Patrick - I can look into this. Thank you.

 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6865) Decide on semantics for string identifiers in DataFrame API


 [ 
https://issues.apache.org/jira/browse/SPARK-6865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6865:
---
Summary: Decide on semantics for string identifiers in DataFrame API  (was: 
Decide on semantics for string identifiers in DataSource API)

 Decide on semantics for string identifiers in DataFrame API
 ---

 Key: SPARK-6865
 URL: https://issues.apache.org/jira/browse/SPARK-6865
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker

 There are two options:
  - Quoted Identifiers: meaning that the strings are treated as though they 
 were in backticks in SQL.  Any weird characters (spaces, or, etc) are 
 considered part of the identifier.  Kind of weird given that `*` is already a 
 special identifier explicitly allowed by the API
  - Unquoted parsed identifiers: would allow users to specify things like 
 tableAlias.*  However, would also require explicit use of `backticks` for 
 identifiers with weird characters in them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6876) DataFrame.na.replace value support for Python

Reynold Xin created SPARK-6876:
--

 Summary: DataFrame.na.replace value support for Python
 Key: SPARK-6876
 URL: https://issues.apache.org/jira/browse/SPARK-6876
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


Scala/Java support is in. We should provide the Python version, similar to what 
Pandas supports.

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6863) Formatted list broken on Hive compatibility section of SQL programming guide


 [ 
https://issues.apache.org/jira/browse/SPARK-6863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6863:
---
Assignee: Santiago M. Mola

 Formatted list broken on Hive compatibility section of SQL programming guide
 

 Key: SPARK-6863
 URL: https://issues.apache.org/jira/browse/SPARK-6863
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Santiago M. Mola
Assignee: Santiago M. Mola
Priority: Trivial
 Fix For: 1.3.1, 1.4.0


 Formatted list broken on Hive compatibility section of SQL programming guide. 
 It does not appear as a list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3937) Unsafe memory access inside of Snappy library

2015-04-12 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491857#comment-14491857
 ] 

Guoqiang Li commented on SPARK-3937:


Get data:

{code:none}wget 
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kdda.bz2{code}

Get code:

{code:none}git clone https://github.com/cloudml/zen.git{code}

mvn  -DskipTests clean  package


spark-defaults.conf: 
{code:none}
spark.yarn.dist.archives hdfs://ns1:8020/input/lbs/recommend/toona/spark/conf
spark.yarn.user.classpath.first true
spark.cleaner.referenceTracking.blocking true
spark.cleaner.referenceTracking.cleanCheckpoints true
spark.cleaner.referenceTracking.blocking.shuffle true
spark.yarn.historyServer.address 10dian71:18080
spark.executor.cores 2
spark.yarn.executor.memoryOverhead 1
spark.yarn.driver.memoryOverhead  1
spark.executor.instances 36
spark.rdd.compress true
spark.executor.memory   4g
spark.akka.frameSize 20
spark.akka.askTimeout120
spark.akka.timeout   120
spark.default.parallelism72
spark.locality.wait  1
spark.core.connection.ack.wait.timeout 360
spark.storage.memoryFraction 0.1
spark.broadcast.factory org.apache.spark.broadcast.TorrentBroadcastFactory
spark.driver.maxResultSize  4000
#spark.shuffle.blockTransferService nio
#spark.akka.heartbeat.interval 100
#spark.kryoserializer.buffer.max.mb 128
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.apache.spark.graphx.GraphKryoRegistrator
#spark.kryo.registrator com.github.cloudml.zen.ml.clustering.LDAKryoRegistrator
{code}

Reproduce: 

{code:none}./bin/spark-shell --master yarn-client --driver-memory 8g --jars 
/opt/spark/classes/zen-assembly.jar{code}

{code:none}
import com.github.cloudml.zen.ml.regression.LogisticRegression
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.regression.LabeledPoint
val dataSet = MLUtils.loadLibSVMFile(sc, 
/input/lbs/recommend/kdda/*).repartition(72).cache()
val numIterations = 150
val stepSize = 0.1
val l1 = 0.0
val epsilon = 1e-6
val useAdaGrad = false
LogisticRegression.trainMIS(dataSet, numIterations, stepSize, l1, epsilon, 
useAdaGrad)
{code}

 Unsafe memory access inside of Snappy library
 -

 Key: SPARK-3937
 URL: https://issues.apache.org/jira/browse/SPARK-3937
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.3.0
Reporter: Patrick Wendell

 This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't 
 have much information about this other than the stack trace. However, it was 
 concerning enough I figured I should post it.
 {code}
 java.lang.InternalError: a fault occurred in a recent unsafe memory access 
 operation in compiled Java code
 org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
 org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
 org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
 
 org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355)
 
 org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159)
 org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
 
 java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
 
 java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712)
 
 java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
 java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
 java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)

[jira] [Commented] (SPARK-6823) Add a model.matrix like capability to DataFrames (modelDataFrame)


[ 
https://issues.apache.org/jira/browse/SPARK-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491885#comment-14491885
 ] 

Joseph K. Bradley commented on SPARK-6823:
--

This sounds like it would be covered by the OneHotEncoder + VectorAssembler 
feature transformers:
* [https://issues.apache.org/jira/browse/SPARK-5888]
* [https://issues.apache.org/jira/browse/SPARK-5885]

Do you think these belong within DataFrame (and that this JIRA should be for 
SQL instead of ML)?

 Add a model.matrix like capability to DataFrames (modelDataFrame)
 -

 Key: SPARK-6823
 URL: https://issues.apache.org/jira/browse/SPARK-6823
 Project: Spark
  Issue Type: New Feature
  Components: ML, SparkR
Reporter: Shivaram Venkataraman

 Currently Mllib modeling tools work only with double data. However, data 
 tables in practice often have a set of categorical fields (factors in R), 
 that need to be converted to a set of 0/1 indicator variables (making the 
 data actually used in a modeling algorithm completely numeric). In R, this is 
 handled in modeling functions using the model.matrix function. Similar 
 functionality needs to be available within Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5885) Add VectorAssembler


 [ 
https://issues.apache.org/jira/browse/SPARK-5885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5885.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5196
[https://github.com/apache/spark/pull/5196]

 Add VectorAssembler
 ---

 Key: SPARK-5885
 URL: https://issues.apache.org/jira/browse/SPARK-5885
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.4.0


 `VectorAssembler` takes a list of columns (of type double/int/vector) and 
 merge them into a single vector column.
 {code}
 val va = new VectorAssembler()
   .setInputCols(userFeatures, dayOfWeek, timeOfDay)
   .setOutputCol(features)
 {code}
 In the first version, it should be okay if it doesn't handle ML attributes 
 (SPARK-4588).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5886) Add LabelIndexer


 [ 
https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5886.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4735
[https://github.com/apache/spark/pull/4735]

 Add LabelIndexer
 

 Key: SPARK-5886
 URL: https://issues.apache.org/jira/browse/SPARK-5886
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.4.0


 `LabelIndexer` takes a column of labels (raw categories) and outputs an 
 integer column with labels indexed by their frequency.
 {code}
 va li = new LabelIndexer()
   .setInputCol(country)
   .setOutputCol(countryIndex)
 {code}
 In the output column, we should store the label to index map as an ML 
 attribute. The index should be ordered by frequency, where the most frequent 
 label gets index 0, to enhance sparsity.
 We can discuss whether this should index multiple columns at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY


 [ 
https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dean Chen updated SPARK-6868:
-
Comment: was deleted

(was: https://github.com/apache/spark/pull/5477)

 Container link broken on Spark UI Executors page when YARN is set to 
 HTTPS_ONLY
 ---

 Key: SPARK-6868
 URL: https://issues.apache.org/jira/browse/SPARK-6868
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0
Reporter: Dean Chen
 Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png


 The stdout and stderr log links on the executor page will use the http:// 
 prefix even if the node manager does not support http and only https via 
 setting yarn.http.policy=HTTPS_ONLY.
 Unfortunately the unencrypted http link in that case does not return a 404 
 but a binary file containing random binary chars. This causes a lot of 
 confusion for the end user since it seems like the log file exists and is 
 just filled with garbage. (see attached screenshot)
 The fix is to prefix container log links with https:// instead of http:// if 
 yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen 
 here: 
 https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY

Dean Chen created SPARK-6868:


 Summary: Container link broken on Spark UI Executors page when 
YARN is set to HTTPS_ONLY
 Key: SPARK-6868
 URL: https://issues.apache.org/jira/browse/SPARK-6868
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.2.1, 1.2.0, 1.1.1, 1.1.0
Reporter: Dean Chen


The stdout and stderr log links on the executor page will use the http:// 
prefix even if the node manager does not support http and only https via 
setting yarn.http.policy=HTTPS_ONLY.

Unfortunately the unencrypted http link in that case does not return a 404 but 
a binary file containing random binary chars. This causes a lot of confusion 
for the end user since it seems like the log file exists and is just filled 
with garbage. (see attached screenshot)

The fix is to prefix container log links with https:// instead of http:// if 
yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen here: 
https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6869) Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node


 [ 
https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6869:
---

Assignee: Apache Spark

 Pass PYTHONPATH to executor, so that executor can read pyspark file from 
 local file system on executor node
 ---

 Key: SPARK-6869
 URL: https://issues.apache.org/jira/browse/SPARK-6869
 Project: Spark
  Issue Type: Improvement
Reporter: Weizhong
Assignee: Apache Spark
Priority: Minor

 From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the 
 assembly jar are package by JDK 1.7+, so pass the PYTHONPATH(set in 
 spark-env.sh) to executor so that executor python process can read pyspark 
 file from local file system rather than from assembly jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6869) Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node


[ 
https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491373#comment-14491373
 ] 

Apache Spark commented on SPARK-6869:
-

User 'Sephiroth-Lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5478

 Pass PYTHONPATH to executor, so that executor can read pyspark file from 
 local file system on executor node
 ---

 Key: SPARK-6869
 URL: https://issues.apache.org/jira/browse/SPARK-6869
 Project: Spark
  Issue Type: Improvement
Reporter: Weizhong
Priority: Minor

 From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the 
 assembly jar are package by JDK 1.7+, so pass the PYTHONPATH(set in 
 spark-env.sh) to executor so that executor python process can read pyspark 
 file from local file system rather than from assembly jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6869) Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node


 [ 
https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6869:
---

Assignee: (was: Apache Spark)

 Pass PYTHONPATH to executor, so that executor can read pyspark file from 
 local file system on executor node
 ---

 Key: SPARK-6869
 URL: https://issues.apache.org/jira/browse/SPARK-6869
 Project: Spark
  Issue Type: Improvement
Reporter: Weizhong
Priority: Minor

 From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the 
 assembly jar are package by JDK 1.7+, so pass the PYTHONPATH(set in 
 spark-env.sh) to executor so that executor python process can read pyspark 
 file from local file system rather than from assembly jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted

2015-04-12 Thread Weizhong (JIRA)

Weizhong created SPARK-6870:
---

 Summary: Catch InterruptedException when yarn application state 
monitor thread been interrupted
 Key: SPARK-6870
 URL: https://issues.apache.org/jira/browse/SPARK-6870
 Project: Spark
  Issue Type: Improvement
Reporter: Weizhong
Priority: Minor


On PR #5305 we interrupt the monitor thread but forget to catch the 
InterruptedException, then in the log will print the stack info, so we need to 
catch it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY


 [ 
https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dean Chen updated SPARK-6868:
-
Attachment: Screen Shot 2015-04-11 at 11.49.21 PM.png

 Container link broken on Spark UI Executors page when YARN is set to 
 HTTPS_ONLY
 ---

 Key: SPARK-6868
 URL: https://issues.apache.org/jira/browse/SPARK-6868
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0
Reporter: Dean Chen
 Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png


 The stdout and stderr log links on the executor page will use the http:// 
 prefix even if the node manager does not support http and only https via 
 setting yarn.http.policy=HTTPS_ONLY.
 Unfortunately the unencrypted http link in that case does not return a 404 
 but a binary file containing random binary chars. This causes a lot of 
 confusion for the end user since it seems like the log file exists and is 
 just filled with garbage. (see attached screenshot)
 The fix is to prefix container log links with https:// instead of http:// if 
 yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen 
 here: 
 https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY


 [ 
https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6868:
---

Assignee: Apache Spark

 Container link broken on Spark UI Executors page when YARN is set to 
 HTTPS_ONLY
 ---

 Key: SPARK-6868
 URL: https://issues.apache.org/jira/browse/SPARK-6868
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0
Reporter: Dean Chen
Assignee: Apache Spark
 Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png


 The stdout and stderr log links on the executor page will use the http:// 
 prefix even if the node manager does not support http and only https via 
 setting yarn.http.policy=HTTPS_ONLY.
 Unfortunately the unencrypted http link in that case does not return a 404 
 but a binary file containing random binary chars. This causes a lot of 
 confusion for the end user since it seems like the log file exists and is 
 just filled with garbage. (see attached screenshot)
 The fix is to prefix container log links with https:// instead of http:// if 
 yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen 
 here: 
 https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY


[ 
https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491354#comment-14491354
 ] 

Dean Chen commented on SPARK-6868:
--

https://github.com/apache/spark/pull/5477

 Container link broken on Spark UI Executors page when YARN is set to 
 HTTPS_ONLY
 ---

 Key: SPARK-6868
 URL: https://issues.apache.org/jira/browse/SPARK-6868
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0
Reporter: Dean Chen
 Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png


 The stdout and stderr log links on the executor page will use the http:// 
 prefix even if the node manager does not support http and only https via 
 setting yarn.http.policy=HTTPS_ONLY.
 Unfortunately the unencrypted http link in that case does not return a 404 
 but a binary file containing random binary chars. This causes a lot of 
 confusion for the end user since it seems like the log file exists and is 
 just filled with garbage. (see attached screenshot)
 The fix is to prefix container log links with https:// instead of http:// if 
 yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen 
 here: 
 https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY


 [ 
https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6868:
---

Assignee: (was: Apache Spark)

 Container link broken on Spark UI Executors page when YARN is set to 
 HTTPS_ONLY
 ---

 Key: SPARK-6868
 URL: https://issues.apache.org/jira/browse/SPARK-6868
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0
Reporter: Dean Chen
 Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png


 The stdout and stderr log links on the executor page will use the http:// 
 prefix even if the node manager does not support http and only https via 
 setting yarn.http.policy=HTTPS_ONLY.
 Unfortunately the unencrypted http link in that case does not return a 404 
 but a binary file containing random binary chars. This causes a lot of 
 confusion for the end user since it seems like the log file exists and is 
 just filled with garbage. (see attached screenshot)
 The fix is to prefix container log links with https:// instead of http:// if 
 yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen 
 here: 
 https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY


[ 
https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491355#comment-14491355
 ] 

Apache Spark commented on SPARK-6868:
-

User 'deanchen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5477

 Container link broken on Spark UI Executors page when YARN is set to 
 HTTPS_ONLY
 ---

 Key: SPARK-6868
 URL: https://issues.apache.org/jira/browse/SPARK-6868
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0
Reporter: Dean Chen
 Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png


 The stdout and stderr log links on the executor page will use the http:// 
 prefix even if the node manager does not support http and only https via 
 setting yarn.http.policy=HTTPS_ONLY.
 Unfortunately the unencrypted http link in that case does not return a 404 
 but a binary file containing random binary chars. This causes a lot of 
 confusion for the end user since it seems like the log file exists and is 
 just filled with garbage. (see attached screenshot)
 The fix is to prefix container log links with https:// instead of http:// if 
 yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen 
 here: 
 https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY


 [ 
https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dean Chen updated SPARK-6868:
-
Component/s: (was: Spark Core)
 YARN

 Container link broken on Spark UI Executors page when YARN is set to 
 HTTPS_ONLY
 ---

 Key: SPARK-6868
 URL: https://issues.apache.org/jira/browse/SPARK-6868
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0
Reporter: Dean Chen
 Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png


 The stdout and stderr log links on the executor page will use the http:// 
 prefix even if the node manager does not support http and only https via 
 setting yarn.http.policy=HTTPS_ONLY.
 Unfortunately the unencrypted http link in that case does not return a 404 
 but a binary file containing random binary chars. This causes a lot of 
 confusion for the end user since it seems like the log file exists and is 
 just filled with garbage. (see attached screenshot)
 The fix is to prefix container log links with https:// instead of http:// if 
 yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen 
 here: 
 https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6869) Pass PYTHONPATH to executor, so that executor can read pyspark file from local file system on executor node

2015-04-12 Thread Weizhong (JIRA)

Weizhong created SPARK-6869:
---

 Summary: Pass PYTHONPATH to executor, so that executor can read 
pyspark file from local file system on executor node
 Key: SPARK-6869
 URL: https://issues.apache.org/jira/browse/SPARK-6869
 Project: Spark
  Issue Type: Improvement
Reporter: Weizhong
Priority: Minor


From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the 
assembly jar are package by JDK 1.7+, so pass the PYTHONPATH(set in 
spark-env.sh) to executor so that executor python process can read pyspark 
file from local file system rather than from assembly jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted


 [ 
https://issues.apache.org/jira/browse/SPARK-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6870:
---

Assignee: Apache Spark

 Catch InterruptedException when yarn application state monitor thread been 
 interrupted
 --

 Key: SPARK-6870
 URL: https://issues.apache.org/jira/browse/SPARK-6870
 Project: Spark
  Issue Type: Improvement
Reporter: Weizhong
Assignee: Apache Spark
Priority: Minor

 On PR #5305 we interrupt the monitor thread but forget to catch the 
 InterruptedException, then in the log will print the stack info, so we need 
 to catch it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted


[ 
https://issues.apache.org/jira/browse/SPARK-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491380#comment-14491380
 ] 

Apache Spark commented on SPARK-6870:
-

User 'Sephiroth-Lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5479

 Catch InterruptedException when yarn application state monitor thread been 
 interrupted
 --

 Key: SPARK-6870
 URL: https://issues.apache.org/jira/browse/SPARK-6870
 Project: Spark
  Issue Type: Improvement
Reporter: Weizhong
Priority: Minor

 On PR #5305 we interrupt the monitor thread but forget to catch the 
 InterruptedException, then in the log will print the stack info, so we need 
 to catch it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted


 [ 
https://issues.apache.org/jira/browse/SPARK-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6870:
---

Assignee: (was: Apache Spark)

 Catch InterruptedException when yarn application state monitor thread been 
 interrupted
 --

 Key: SPARK-6870
 URL: https://issues.apache.org/jira/browse/SPARK-6870
 Project: Spark
  Issue Type: Improvement
Reporter: Weizhong
Priority: Minor

 On PR #5305 we interrupt the monitor thread but forget to catch the 
 InterruptedException, then in the log will print the stack info, so we need 
 to catch it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6866) Cleanup duplicated dependency in pom.xml


 [ 
https://issues.apache.org/jira/browse/SPARK-6866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6866:
-
Due Date: (was: 15/Apr/15)
Priority: Trivial  (was: Minor)
Assignee: Guancheng Chen

 Cleanup duplicated dependency in pom.xml
 

 Key: SPARK-6866
 URL: https://issues.apache.org/jira/browse/SPARK-6866
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
Reporter: Guancheng Chen
Assignee: Guancheng Chen
Priority: Trivial
  Labels: build, maven
 Fix For: 1.4.0


 It turns out launcher/pom.xml has duplicated scalatest dependency. We should 
 remove it in this child pom.xml since it has already inherited the dependency 
 from the parent pom.xml. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6866) Cleanup duplicated dependency in pom.xml


 [ 
https://issues.apache.org/jira/browse/SPARK-6866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6866.
--
Resolution: Fixed

Issue resolved by pull request 5476
[https://github.com/apache/spark/pull/5476]

 Cleanup duplicated dependency in pom.xml
 

 Key: SPARK-6866
 URL: https://issues.apache.org/jira/browse/SPARK-6866
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.0
Reporter: Guancheng Chen
Priority: Minor
  Labels: build, maven
 Fix For: 1.4.0


 It turns out launcher/pom.xml has duplicated scalatest dependency. We should 
 remove it in this child pom.xml since it has already inherited the dependency 
 from the parent pom.xml. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-761) Print a nicer error message when incompatible Spark binaries try to talk

2015-04-12 Thread Harsh Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491420#comment-14491420
 ] 

Harsh Gupta commented on SPARK-761:
---

[~aash] How do I do a compatibility check on API on which they talk ? Can you 
give a bit more specific detail on as to how to proceed . I can do it as a 
starter task to understand the core of spark functioning and that will get me 
going.

 Print a nicer error message when incompatible Spark binaries try to talk
 

 Key: SPARK-761
 URL: https://issues.apache.org/jira/browse/SPARK-761
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Priority: Minor
  Labels: starter

 As a starter task, it would be good to audit the current behavior for 
 different client - server pairs with respect to how exceptions occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-6842) mvn -DskipTests clean package fails


 [ 
https://issues.apache.org/jira/browse/SPARK-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sree Vaddi closed SPARK-6842.
-

build successful.

 mvn -DskipTests clean package fails
 ---

 Key: SPARK-6842
 URL: https://issues.apache.org/jira/browse/SPARK-6842
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
 Environment: CentOS v7
 Oracle JDK 8 w/ Unlimited Strength Crypto jars
Reporter: Sree Vaddi
Priority: Blocker
 Attachments: mvn.clean.package.log


 Fork on github
 $ git clone https://github.com/userid/spark.git
 $ cd spark
 $ mvn -DskipTests clean package
 ...
 ...
 wait 39 minutes
 ===
 My diagnosis:
 By default, I am in 'master' branch.
 Usually, 'master' branches are highly volatile.
 May be I should try 'branch-1.3'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6871) WITH clause in CTE can not following another WITH clause

2015-04-12 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-6871:
--

 Summary: WITH clause in CTE can not following another WITH clause
 Key: SPARK-6871
 URL: https://issues.apache.org/jira/browse/SPARK-6871
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


For example, this sql query WITH q1 AS (SELECT * FROM testData) WITH q2 AS 
(SELECT * FROM q1) SELECT * FROM q2 should not be successfully parsed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6871) WITH clause in CTE can not following another WITH clause


[ 
https://issues.apache.org/jira/browse/SPARK-6871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491401#comment-14491401
 ] 

Apache Spark commented on SPARK-6871:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5480

 WITH clause in CTE can not following another WITH clause
 

 Key: SPARK-6871
 URL: https://issues.apache.org/jira/browse/SPARK-6871
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh

 For example, this sql query WITH q1 AS (SELECT * FROM testData) WITH q2 AS 
 (SELECT * FROM q1) SELECT * FROM q2 should not be successfully parsed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6871) WITH clause in CTE can not following another WITH clause


 [ 
https://issues.apache.org/jira/browse/SPARK-6871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6871:
---

Assignee: (was: Apache Spark)

 WITH clause in CTE can not following another WITH clause
 

 Key: SPARK-6871
 URL: https://issues.apache.org/jira/browse/SPARK-6871
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh

 For example, this sql query WITH q1 AS (SELECT * FROM testData) WITH q2 AS 
 (SELECT * FROM q1) SELECT * FROM q2 should not be successfully parsed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6545) Minor changes for CompactBuffer


 [ 
https://issues.apache.org/jira/browse/SPARK-6545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6545.
--
Resolution: Won't Fix

I think this is WontFix given https://github.com/apache/spark/pull/5199 but 
reopen if I misunderstood.

 Minor changes for CompactBuffer
 ---

 Key: SPARK-6545
 URL: https://issues.apache.org/jira/browse/SPARK-6545
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor

 HashedRelation should always return a Not null CompactBuffer, which will be 
 helpful for the further improvement of Multiway Join



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1303) Added discretization capability to MLlib.


 [ 
https://issues.apache.org/jira/browse/SPARK-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1303.
--
Resolution: Won't Fix

Sounds like this should start outside MLlib: 
https://github.com/apache/spark/pull/216

 Added discretization capability to MLlib.
 -

 Key: SPARK-1303
 URL: https://issues.apache.org/jira/browse/SPARK-1303
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: LIDIAgroup

 Some time ago, we have commented with Ameet Talwalkar the possibilty of 
 including both Feature Selection and Discretization algorithms to MLlib.
 In this patch we've implemented Entropy Minimization Discretization following 
 the algorithm described in the paper Multi-interval discretization of 
 continuous-valued attributes for classification learning by Fayyad and Irani 
 (1993). This is one of the most used Discretizers and is already included in 
 most libraries like Weka, etc. This can be used as base for FS algorims and 
 the NaiveBayes already included in MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6864) Spark's Multilabel Classifier runs out of memory on small datasets


[ 
https://issues.apache.org/jira/browse/SPARK-6864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491437#comment-14491437
 ] 

Sean Owen commented on SPARK-6864:
--

I believe this is the *driver* process running out of memory. You have massive 
executors but the driver is probably still on 512MB of RAM. Try increasing 
that. I think everything else like your executors and data size is irrelevant 
then and orders of magnitude larger than is needed for this data set.

 Spark's Multilabel Classifier runs out of memory on small datasets
 --

 Key: SPARK-6864
 URL: https://issues.apache.org/jira/browse/SPARK-6864
 Project: Spark
  Issue Type: Test
  Components: MLlib
Affects Versions: 1.2.1
 Environment: EC2 with 8-96 instances up to r3.4xlarge
 The test fails on every configuration
Reporter: John Canny
 Fix For: 1.2.1


 When trying to run Spark's MultiLabel classifier 
 (LogisticRegressionWithLBFGS) on the RCV1 V2 dataset (about 0.5GB, 100 
 labels), the classifier runs out of memory. The number of tasks per executor 
 doesnt seem to matter. It happens even with a single task per 120 GB 
 executor. The dataset is the concatenation of the test files from the rcv1v2 
 (topics; full sets) group here:
 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html
 Here's the code:
 import org.apache.spark.SparkContext
 import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
 import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
 import org.apache.spark.mllib.optimization.L1Updater
 import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.linalg.Vectors
 import org.apache.spark.mllib.util.MLUtils
 import scala.compat.Platform._ 
 val nnodes = 8
 val t0=currentTime
 // Load training data in LIBSVM format.
 val train = MLUtils.loadLibSVMFile(sc, s3n://bidmach/RCV1train.libsvm, 
 true, 276544, nnodes)
 val test = MLUtils.loadLibSVMFile(sc, s3n://bidmach/RCV1test.libsvm, true, 
 276544, nnodes)
 val t1=currentTime;
 val lrAlg = new LogisticRegressionWithLBFGS()
 lrAlg.setNumClasses(100).optimizer.
   setNumIterations(10).
   setRegParam(1e-10).
   setUpdater(new L1Updater)
 // Run training algorithm to build the model
 val model = lrAlg.run(train)
 val t2=currentTime



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6871) WITH clause in CTE can not following another WITH clause


 [ 
https://issues.apache.org/jira/browse/SPARK-6871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6871:
---

Assignee: Apache Spark

 WITH clause in CTE can not following another WITH clause
 

 Key: SPARK-6871
 URL: https://issues.apache.org/jira/browse/SPARK-6871
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Apache Spark

 For example, this sql query WITH q1 AS (SELECT * FROM testData) WITH q2 AS 
 (SELECT * FROM q1) SELECT * FROM q2 should not be successfully parsed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6677) pyspark.sql nondeterministic issue with row fields

2015-04-12 Thread Stefano Parmesan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491459#comment-14491459
 ] 

Stefano Parmesan commented on SPARK-6677:
-

glad it helped! we're very eager to try it out

 pyspark.sql nondeterministic issue with row fields
 --

 Key: SPARK-6677
 URL: https://issues.apache.org/jira/browse/SPARK-6677
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0
 Environment: spark version: spark-1.3.0-bin-hadoop2.4
 python version: Python 2.7.6
 operating system: MacOS, x86_64 x86_64 x86_64 GNU/Linux
Reporter: Stefano Parmesan
Assignee: Davies Liu
  Labels: pyspark, row, sql
 Fix For: 1.3.1, 1.4.0


 The following issue happens only when running pyspark in the python 
 interpreter, it works correctly with spark-submit.
 Reading two json files containing objects with a different structure leads 
 sometimes to the definition of wrong Rows, where the fields of a file are 
 used for the other one.
 I was able to write a sample code that reproduce this issue one out of three 
 times; the code snippet is available at the following link, together with 
 some (very simple) data samples:
 https://gist.github.com/armisael/e08bb4567d0a11efe2db



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6867) Dropout regularization


 [ 
https://issues.apache.org/jira/browse/SPARK-6867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6867:
-
Target Version/s:   (was: 1.4.0)

 Dropout regularization
 --

 Key: SPARK-6867
 URL: https://issues.apache.org/jira/browse/SPARK-6867
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Rakesh Chalasani
Priority: Minor

 Linear models is MLLIB so far support no regularization, L1 and L2. Another 
 more recently popularized method for regularization is dropout 
 [http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf]. The dropout 
 regularization basically randomly omit some of the input features at each 
 iteration. 
 Though this approach is particularly used in training deep networks, they 
 could also be very useful on a linear models as if promotes adaptive 
 regularization. This approach is particularly useful in NLP 
 [http://papers.nips.cc/paper/4882-dropout-training-as-adaptive-regularization.pdf]
  and, because of its simplicity can be easily adopted for streaming linear 
 models as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6843) Potential visibility problem for the state of Executor


 [ 
https://issues.apache.org/jira/browse/SPARK-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6843.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5448
[https://github.com/apache/spark/pull/5448]

 Potential visibility problem for the state of Executor
 

 Key: SPARK-6843
 URL: https://issues.apache.org/jira/browse/SPARK-6843
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: zhichao-li
Priority: Minor
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6843) Potential visibility problem for the state of Executor


 [ 
https://issues.apache.org/jira/browse/SPARK-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6843:
-
Priority: Trivial  (was: Minor)
Assignee: zhichao-li

 Potential visibility problem for the state of Executor
 

 Key: SPARK-6843
 URL: https://issues.apache.org/jira/browse/SPARK-6843
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: zhichao-li
Assignee: zhichao-li
Priority: Trivial
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6842) mvn -DskipTests clean package fails


 [ 
https://issues.apache.org/jira/browse/SPARK-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sree Vaddi updated SPARK-6842:
--
Attachment: mvn.clean.package.log

mvn package is successful on my machine, now.
previously, i was working in vm with code on a shared file system from local 
host.
instead, checked out code to vm local file system.
no other changes.
attached the build log.

high level steps: (useful for newbies)
install virtual box
create new vm
use centos v7
git clone
mvn package


 mvn -DskipTests clean package fails
 ---

 Key: SPARK-6842
 URL: https://issues.apache.org/jira/browse/SPARK-6842
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
 Environment: CentOS v7
 Oracle JDK 8 w/ Unlimited Strength Crypto jars
Reporter: Sree Vaddi
Priority: Blocker
 Attachments: mvn.clean.package.log


 Fork on github
 $ git clone https://github.com/userid/spark.git
 $ cd spark
 $ mvn -DskipTests clean package
 ...
 ...
 wait 39 minutes
 ===
 My diagnosis:
 By default, I am in 'master' branch.
 Usually, 'master' branches are highly volatile.
 May be I should try 'branch-1.3'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6151) schemaRDD to parquetfile with saveAsParquetFile control the HDFS block size


[ 
https://issues.apache.org/jira/browse/SPARK-6151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491502#comment-14491502
 ] 

Sree Vaddi commented on SPARK-6151:
---

[~cnstar9988]
The HDFS Block Size is set once when you first install Hadoop.
It is possible to change the HDFS block size in your hadoop configuration and 
restart your hadoop for the change to take effect. (read literature and feel 
comfortable, before you make this change).
Then, you can run saveAsParquetFile().  Which will now use the new HDFS block 
size.


 schemaRDD to parquetfile with saveAsParquetFile control the HDFS block size
 ---

 Key: SPARK-6151
 URL: https://issues.apache.org/jira/browse/SPARK-6151
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1
Reporter: Littlestar
Priority: Trivial

 How schemaRDD to parquetfile with saveAsParquetFile control the HDFS block 
 size. may be Configuration need.
 related question by others.
 http://apache-spark-user-list.1001560.n3.nabble.com/HDFS-block-size-for-parquet-output-tt21183.html
 http://qnalist.com/questions/5054892/spark-sql-parquet-and-impala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6872) external sort need to copy

2015-04-12 Thread Adrian Wang (JIRA)

Adrian Wang created SPARK-6872:
--

 Summary: external sort need to copy
 Key: SPARK-6872
 URL: https://issues.apache.org/jira/browse/SPARK-6872
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6873) Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements

Sean Owen created SPARK-6873:


 Summary: Some Hive-Catalyst comparison tests fail due to 
unimportant order of some printed elements
 Key: SPARK-6873
 URL: https://issues.apache.org/jira/browse/SPARK-6873
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.3.1
Reporter: Sean Owen
Priority: Minor


As I mentioned, I've been seeing 4 test failures in Hive tests for a while, and 
actually it still affects master. I think it's a superficial problem that only 
turns up when running on Java 8, but still, would probably be an easy fix and 
good to fix.

Specifically, here are four tests and the bit that fails the comparison, below. 
I tried to diagnose this but had trouble even finding where some of this 
occurs, like the list of synonyms?

{code}
- show_tblproperties *** FAILED ***
  Results do not match for show_tblproperties:
...
  !== HIVE - 2 row(s) ==   == CATALYST - 2 row(s) ==
  !tmp  truebar bar value
  !bar  bar value   tmp true (HiveComparisonTest.scala:391)
{code}

{code}
- show_create_table_serde *** FAILED ***
  Results do not match for show_create_table_serde:
...
   WITH SERDEPROPERTIES (  WITH 
SERDEPROPERTIES ( 
  !  'serialization.format'='$', 
'field.delim'=',', 
  !  'field.delim'=',')  
'serialization.format'='$')
{code}

{code}
- udf_std *** FAILED ***
  Results do not match for udf_std:
...
  !== HIVE - 2 row(s) == == CATALYST - 
2 row(s) ==
   std(x) - Returns the standard deviation of a set of numbers   std(x) - 
Returns the standard deviation of a set of numbers
  !Synonyms: stddev_pop, stddev  Synonyms: 
stddev, stddev_pop (HiveComparisonTest.scala:391)
{code}

{code}
- udf_stddev *** FAILED ***
  Results do not match for udf_stddev:
...
  !== HIVE - 2 row(s) ==== CATALYST 
- 2 row(s) ==
   stddev(x) - Returns the standard deviation of a set of numbers   stddev(x) - 
Returns the standard deviation of a set of numbers
  !Synonyms: stddev_pop, stdSynonyms: 
std, stddev_pop (HiveComparisonTest.scala:391)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows


[ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491575#comment-14491575
 ] 

Cheng Lian commented on SPARK-6859:
---

A better way can be defensive copy while inserting byte arrays to parquet, so 
that we don't suffer read performance regression.

 Parquet File Binary column statistics error when reuse byte[] among rows
 

 Key: SPARK-6859
 URL: https://issues.apache.org/jira/browse/SPARK-6859
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0
Reporter: Yijie Shen
Priority: Minor

 Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
 GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
 reused among rows but has different content each time. When I convert it to a 
 dataFrame and save it as Parquet File, the file's row group statistic(max  
 min) of Binary column would be wrong.
 \\
 \\
 Here is the reason: In Parquet, BinaryStatistic just keep max  min as 
 parquet.io.api.Binary references, Spark sql would generate a new Binary 
 backed by the same Array\[Byte\] passed from row.
   
 | |reference| |backed| |  
 |max: Binary|--|ByteArrayBackedBinary|--|Array\[Byte\]|
 Therefore, each time parquet updating row group's statistic, max  min would 
 always refer to the same Array\[Byte\], which has new content each time. When 
 parquet decides to save it into file, the last row's content would be saved 
 as both max  min.
 \\
 \\
 It seems it is a parquet bug because it's parquet's responsibility to update 
 statistics correctly.
 But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6872) external sort need to copy


[ 
https://issues.apache.org/jira/browse/SPARK-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491509#comment-14491509
 ] 

Apache Spark commented on SPARK-6872:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/5481

 external sort need to copy
 --

 Key: SPARK-6872
 URL: https://issues.apache.org/jira/browse/SPARK-6872
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6872) external sort need to copy


 [ 
https://issues.apache.org/jira/browse/SPARK-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6872:
---

Assignee: (was: Apache Spark)

 external sort need to copy
 --

 Key: SPARK-6872
 URL: https://issues.apache.org/jira/browse/SPARK-6872
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6872) external sort need to copy


 [ 
https://issues.apache.org/jira/browse/SPARK-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6872:
---

Assignee: Apache Spark

 external sort need to copy
 --

 Key: SPARK-6872
 URL: https://issues.apache.org/jira/browse/SPARK-6872
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6431) Couldn't find leader offsets exception when creating KafkaDirectStream


 [ 
https://issues.apache.org/jira/browse/SPARK-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6431.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5454
[https://github.com/apache/spark/pull/5454]

 Couldn't find leader offsets exception when creating KafkaDirectStream
 --

 Key: SPARK-6431
 URL: https://issues.apache.org/jira/browse/SPARK-6431
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Alberto
 Fix For: 1.4.0


 When I try to create an InputDStream using the createDirectStream method of 
 the KafkaUtils class and the kafka topic does not have any messages yet am 
 getting the following error:
 org.apache.spark.SparkException: Couldn't find leader offsets for Set()
 org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't 
 find leader offsets for Set()
   at 
 org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413)
 If I put a message in the topic before creating the DirectStream everything 
 works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5107) A trick log info for the start of Receiver


[ 
https://issues.apache.org/jira/browse/SPARK-5107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491626#comment-14491626
 ] 

Sree Vaddi commented on SPARK-5107:
---

[~srowen]
This may be closed.
I could do, but I do not have edit permissions.


 A trick log info for the start of Receiver
 --

 Key: SPARK-5107
 URL: https://issues.apache.org/jira/browse/SPARK-5107
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: uncleGen
Priority: Trivial

 Receiver will register itself whenever it begins to start. But, it is trick 
 to log the same information. Especially, at the preStart(),  it will also 
 register itself. Just like the receiver has started twice. Just like:
 !https://raw.githubusercontent.com/uncleGen/Tech-Notes/master/3.JPG!
 We can log the information more clearly. Like the attempt times to start. 
 Of course, nothing matters performance or use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5364) HiveQL transform doesn't support the non output clause


[ 
https://issues.apache.org/jira/browse/SPARK-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491625#comment-14491625
 ] 

Sree Vaddi commented on SPARK-5364:
---

[~srowen]
This may be closed.
I could do, but I do not have edit permissions.


 HiveQL transform doesn't support the non output clause
 --

 Key: SPARK-5364
 URL: https://issues.apache.org/jira/browse/SPARK-5364
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Trivial

 This is a quick fix for query (in HiveContext) like:
 {panel}
 SELECT transform(key + 1, value) USING '/bin/cat' FROM src
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5364) HiveQL transform doesn't support the non output clause


 [ 
https://issues.apache.org/jira/browse/SPARK-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5364.

   Resolution: Duplicate
Fix Version/s: (was: 1.3.1)
   1.3.0

 HiveQL transform doesn't support the non output clause
 --

 Key: SPARK-5364
 URL: https://issues.apache.org/jira/browse/SPARK-5364
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Trivial
 Fix For: 1.3.0


 This is a quick fix for query (in HiveContext) like:
 {panel}
 SELECT transform(key + 1, value) USING '/bin/cat' FROM src
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows


[ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491558#comment-14491558
 ] 

Cheng Lian commented on SPARK-6859:
---

For 1.3 and prior versions, this issue isn't that serious, since strings are 
immutable. But in 1.4 we are adding mutable UTF8String ([PR 
#5350|https://github.com/apache/spark/pull/5350]).

 Parquet File Binary column statistics error when reuse byte[] among rows
 

 Key: SPARK-6859
 URL: https://issues.apache.org/jira/browse/SPARK-6859
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0
Reporter: Yijie Shen
Priority: Minor

 Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
 GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
 reused among rows but has different content each time. When I convert it to a 
 dataFrame and save it as Parquet File, the file's row group statistic(max  
 min) of Binary column would be wrong.
 \\
 \\
 Here is the reason: In Parquet, BinaryStatistic just keep max  min as 
 parquet.io.api.Binary references, Spark sql would generate a new Binary 
 backed by the same Array\[Byte\] passed from row.
   
 | |reference| |backed| |  
 |max: Binary|--|ByteArrayBackedBinary|--|Array\[Byte\]|
 Therefore, each time parquet updating row group's statistic, max  min would 
 always refer to the same Array\[Byte\], which has new content each time. When 
 parquet decides to save it into file, the last row's content would be saved 
 as both max  min.
 \\
 \\
 It seems it is a parquet bug because it's parquet's responsibility to update 
 statistics correctly.
 But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6873) Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements


[ 
https://issues.apache.org/jira/browse/SPARK-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491567#comment-14491567
 ] 

Sean Owen commented on SPARK-6873:
--

CC [~lian cheng] [~marmbrus] as I bet this would be fairly easy to diagnose for 
someone close to the query planner / catalyst bits.

 Some Hive-Catalyst comparison tests fail due to unimportant order of some 
 printed elements
 --

 Key: SPARK-6873
 URL: https://issues.apache.org/jira/browse/SPARK-6873
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.3.1
Reporter: Sean Owen
Priority: Minor

 As I mentioned, I've been seeing 4 test failures in Hive tests for a while, 
 and actually it still affects master. I think it's a superficial problem that 
 only turns up when running on Java 8, but still, would probably be an easy 
 fix and good to fix.
 Specifically, here are four tests and the bit that fails the comparison, 
 below. I tried to diagnose this but had trouble even finding where some of 
 this occurs, like the list of synonyms?
 {code}
 - show_tblproperties *** FAILED ***
   Results do not match for show_tblproperties:
 ...
   !== HIVE - 2 row(s) ==   == CATALYST - 2 row(s) ==
   !tmptruebar bar value
   !barbar value   tmp true (HiveComparisonTest.scala:391)
 {code}
 {code}
 - show_create_table_serde *** FAILED ***
   Results do not match for show_create_table_serde:
 ...
WITH SERDEPROPERTIES (  WITH 
 SERDEPROPERTIES ( 
   !  'serialization.format'='$', 
 'field.delim'=',', 
   !  'field.delim'=',')  
 'serialization.format'='$')
 {code}
 {code}
 - udf_std *** FAILED ***
   Results do not match for udf_std:
 ...
   !== HIVE - 2 row(s) == == CATALYST 
 - 2 row(s) ==
std(x) - Returns the standard deviation of a set of numbers   std(x) - 
 Returns the standard deviation of a set of numbers
   !Synonyms: stddev_pop, stddev  Synonyms: 
 stddev, stddev_pop (HiveComparisonTest.scala:391)
 {code}
 {code}
 - udf_stddev *** FAILED ***
   Results do not match for udf_stddev:
 ...
   !== HIVE - 2 row(s) ==== 
 CATALYST - 2 row(s) ==
stddev(x) - Returns the standard deviation of a set of numbers   stddev(x) 
 - Returns the standard deviation of a set of numbers
   !Synonyms: stddev_pop, stdSynonyms: 
 std, stddev_pop (HiveComparisonTest.scala:391)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6431) Couldn't find leader offsets exception when creating KafkaDirectStream


 [ 
https://issues.apache.org/jira/browse/SPARK-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6431:
-
Assignee: Cody Koeninger

 Couldn't find leader offsets exception when creating KafkaDirectStream
 --

 Key: SPARK-6431
 URL: https://issues.apache.org/jira/browse/SPARK-6431
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Alberto
Assignee: Cody Koeninger
 Fix For: 1.4.0


 When I try to create an InputDStream using the createDirectStream method of 
 the KafkaUtils class and the kafka topic does not have any messages yet am 
 getting the following error:
 org.apache.spark.SparkException: Couldn't find leader offsets for Set()
 org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't 
 find leader offsets for Set()
   at 
 org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413)
 If I put a message in the topic before creating the DirectStream everything 
 works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6874) Add support for SQL:2003 array type declaration syntax


 [ 
https://issues.apache.org/jira/browse/SPARK-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6874:
---

Assignee: (was: Apache Spark)

 Add support for SQL:2003 array type declaration syntax
 --

 Key: SPARK-6874
 URL: https://issues.apache.org/jira/browse/SPARK-6874
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola
Priority: Minor

 As of SQL:2003, arrays are standard SQL types, However, declaration syntax 
 differs from Spark's CQL-like syntax. Examples of standard syntax:
 BIGINT ARRAY
 BIGINT ARRAY[100]
 BIGINT ARRAY[100] ARRAY[200]
 It would be great to have support standard syntax here.
 Some additional details that this addition should have IMO:
 - Forbit mixed syntax such as ARRAYINT ARRAY[100]
 - Ignore the maximum capacity (ARRAY[N]) but allow it to be specified. This 
 seems to be what others (i.e. PostgreSQL) are doing.
 ARRAYBIGINT ARRAY[100]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6874) Add support for SQL:2003 array type declaration syntax


[ 
https://issues.apache.org/jira/browse/SPARK-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491573#comment-14491573
 ] 

Apache Spark commented on SPARK-6874:
-

User 'smola' has created a pull request for this issue:
https://github.com/apache/spark/pull/5483

 Add support for SQL:2003 array type declaration syntax
 --

 Key: SPARK-6874
 URL: https://issues.apache.org/jira/browse/SPARK-6874
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola
Priority: Minor

 As of SQL:2003, arrays are standard SQL types, However, declaration syntax 
 differs from Spark's CQL-like syntax. Examples of standard syntax:
 BIGINT ARRAY
 BIGINT ARRAY[100]
 BIGINT ARRAY[100] ARRAY[200]
 It would be great to have support standard syntax here.
 Some additional details that this addition should have IMO:
 - Forbit mixed syntax such as ARRAYINT ARRAY[100]
 - Ignore the maximum capacity (ARRAY[N]) but allow it to be specified. This 
 seems to be what others (i.e. PostgreSQL) are doing.
 ARRAYBIGINT ARRAY[100]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6874) Add support for SQL:2003 array type declaration syntax


 [ 
https://issues.apache.org/jira/browse/SPARK-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6874:
---

Assignee: Apache Spark

 Add support for SQL:2003 array type declaration syntax
 --

 Key: SPARK-6874
 URL: https://issues.apache.org/jira/browse/SPARK-6874
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola
Assignee: Apache Spark
Priority: Minor

 As of SQL:2003, arrays are standard SQL types, However, declaration syntax 
 differs from Spark's CQL-like syntax. Examples of standard syntax:
 BIGINT ARRAY
 BIGINT ARRAY[100]
 BIGINT ARRAY[100] ARRAY[200]
 It would be great to have support standard syntax here.
 Some additional details that this addition should have IMO:
 - Forbit mixed syntax such as ARRAYINT ARRAY[100]
 - Ignore the maximum capacity (ARRAY[N]) but allow it to be specified. This 
 seems to be what others (i.e. PostgreSQL) are doing.
 ARRAYBIGINT ARRAY[100]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows


[ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491548#comment-14491548
 ] 

Cheng Lian commented on SPARK-6859:
---

[~yijieshen] Thanks for reporting! And yes, please also open a JIRA ticket for 
Parquet and link it with this one so that it's easier to track.

[~marmbrus] I guess we should disable pushing down filters involving binary 
type before this bug is fixed in Parquet.

 Parquet File Binary column statistics error when reuse byte[] among rows
 

 Key: SPARK-6859
 URL: https://issues.apache.org/jira/browse/SPARK-6859
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0, 1.4.0
Reporter: Yijie Shen
Priority: Minor

 Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
 GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
 reused among rows but has different content each time. When I convert it to a 
 dataFrame and save it as Parquet File, the file's row group statistic(max  
 min) of Binary column would be wrong.
 \\
 \\
 Here is the reason: In Parquet, BinaryStatistic just keep max  min as 
 parquet.io.api.Binary references, Spark sql would generate a new Binary 
 backed by the same Array\[Byte\] passed from row.
   
 | |reference| |backed| |  
 |max: Binary|--|ByteArrayBackedBinary|--|Array\[Byte\]|
 Therefore, each time parquet updating row group's statistic, max  min would 
 always refer to the same Array\[Byte\], which has new content each time. When 
 parquet decides to save it into file, the last row's content would be saved 
 as both max  min.
 \\
 \\
 It seems it is a parquet bug because it's parquet's responsibility to update 
 statistics correctly.
 But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6874) Add support for SQL:2003 array type declaration syntax

2015-04-12 Thread Santiago M. Mola (JIRA)

Santiago M. Mola created SPARK-6874:
---

 Summary: Add support for SQL:2003 array type declaration syntax
 Key: SPARK-6874
 URL: https://issues.apache.org/jira/browse/SPARK-6874
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Santiago M. Mola
Priority: Minor


As of SQL:2003, arrays are standard SQL types, However, declaration syntax 
differs from Spark's CQL-like syntax. Examples of standard syntax:

BIGINT ARRAY
BIGINT ARRAY[100]
BIGINT ARRAY[100] ARRAY[200]

It would be great to have support standard syntax here.

Some additional details that this addition should have IMO:
- Forbit mixed syntax such as ARRAYINT ARRAY[100]
- Ignore the maximum capacity (ARRAY[N]) but allow it to be specified. This 
seems to be what others (i.e. PostgreSQL) are doing.
ARRAYBIGINT ARRAY[100]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5364) HiveQL transform doesn't support the non output clause


 [ 
https://issues.apache.org/jira/browse/SPARK-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5364.

   Resolution: Fixed
Fix Version/s: 1.3.1

 HiveQL transform doesn't support the non output clause
 --

 Key: SPARK-5364
 URL: https://issues.apache.org/jira/browse/SPARK-5364
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Trivial
 Fix For: 1.3.1


 This is a quick fix for query (in HiveContext) like:
 {panel}
 SELECT transform(key + 1, value) USING '/bin/cat' FROM src
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-5364) HiveQL transform doesn't support the non output clause


 [ 
https://issues.apache.org/jira/browse/SPARK-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-5364:


 HiveQL transform doesn't support the non output clause
 --

 Key: SPARK-5364
 URL: https://issues.apache.org/jira/browse/SPARK-5364
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Trivial
 Fix For: 1.3.0


 This is a quick fix for query (in HiveContext) like:
 {panel}
 SELECT transform(key + 1, value) USING '/bin/cat' FROM src
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4801) Add CTE capability to HiveContext


 [ 
https://issues.apache.org/jira/browse/SPARK-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4801.
-
   Resolution: Duplicate
Fix Version/s: 1.4.0

 Add CTE capability to HiveContext
 -

 Key: SPARK-4801
 URL: https://issues.apache.org/jira/browse/SPARK-4801
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jacob Davis
 Fix For: 1.4.0


 This is a request to add CTE functionality to HiveContext.  Common Table 
 Expressions are added in Hive 0.13.0 with HIVE-1180.  Using CTE style syntax 
 within HiveContext currently results in the following caused by message:
 {code}
 Caused by: scala.MatchError: TOK_CTE (of class 
 org.apache.hadoop.hive.ql.parse.ASTNode)
 at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500)
 at org.apache.spark.sql.hive.HiveQl$$anonfun$13.apply(HiveQl.scala:500)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:500)
 at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:248)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5364) HiveQL transform doesn't support the non output clause


 [ 
https://issues.apache.org/jira/browse/SPARK-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian closed SPARK-5364.
-
Assignee: Liang-Chi Hsieh

 HiveQL transform doesn't support the non output clause
 --

 Key: SPARK-5364
 URL: https://issues.apache.org/jira/browse/SPARK-5364
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Liang-Chi Hsieh
Priority: Trivial
 Fix For: 1.3.0


 This is a quick fix for query (in HiveContext) like:
 {panel}
 SELECT transform(key + 1, value) USING '/bin/cat' FROM src
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4760) ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size for tables created from Parquet files


 [ 
https://issues.apache.org/jira/browse/SPARK-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4760.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

The native parquet support (which is used for both Spark SQL and Hive DDL by 
default) automatically computes sizes starting with Spark 1.3.  So running 
ANALYZE is not needed for auto broadcast joins anymore.  Please reopen if you 
see any issues with this new feature.

 ANALYZE TABLE table COMPUTE STATISTICS noscan failed estimating table size 
 for tables created from Parquet files
 --

 Key: SPARK-4760
 URL: https://issues.apache.org/jira/browse/SPARK-4760
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jianshi Huang
Priority: Critical
 Fix For: 1.3.0


 In a older Spark version built around Oct. 12, I was able to use 
   ANALYZE TABLE table COMPUTE STATISTICS noscan
 to get estimated table size, which is important for optimizing joins. (I'm 
 joining 15 small dimension tables, and this is crucial to me).
 In the more recent Spark builds, it fails to estimate the table size unless I 
 remove noscan.
 Here's the statistics I got using DESC EXTENDED:
 old:
 parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166}
 new:
 parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, 
 COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
 And I've tried turning off spark.sql.hive.convertMetastoreParquet in my 
 spark-defaults.conf and the result is unaffected (in both versions).
 Looks like the Parquet support in new Hive (0.13.1) is broken?
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1412) Disable partial aggregation automatically when reduction factor is low


 [ 
https://issues.apache.org/jira/browse/SPARK-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1412:

Summary: Disable partial aggregation automatically when reduction factor is 
low  (was: [SQL] Disable partial aggregation automatically when reduction 
factor is low)

 Disable partial aggregation automatically when reduction factor is low
 --

 Key: SPARK-1412
 URL: https://issues.apache.org/jira/browse/SPARK-1412
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Priority: Minor

 Once we see enough number of rows in partial aggregation and don't observe 
 any reduction, the aggregate operator should just turn off partial 
 aggregation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1412) [SQL] Disable partial aggregation automatically when reduction factor is low