[jira] [Commented] (SPARK-10180) JDBCRDD does not process EqualNullSafe filter.

2015-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743339#comment-14743339
 ] 

Apache Spark commented on SPARK-10180:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/8743

> JDBCRDD does not process EqualNullSafe filter.
> --
>
> Key: SPARK-10180
> URL: https://issues.apache.org/jira/browse/SPARK-10180
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Simply {{JDBCRelation}} passes EqualNullSafe (source.filter) but 
> {{compileFilter()}} in {{JDBCRDD}} does not apply this.
> It would be a single-line update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10588) Saving a DataFrame containing only nulls to JSON doesn't work

2015-09-14 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10588:
--

 Summary: Saving a DataFrame containing only nulls to JSON doesn't 
work
 Key: SPARK-10588
 URL: https://issues.apache.org/jira/browse/SPARK-10588
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian


Snippets to reproduce this issue:
{noformat}
val path = "file:///tmp/spark/null"

// A single row containing a single null double, saving to JSON, wrong
sqlContext.
  range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0").
  write.mode("overwrite").json(path)

sqlContext.read.json(path).show()

++
||
++
||
++

// Two rows each containing a single null double, saving to JSON, wrong
sqlContext.
  range(2).selectExpr("CAST(NULL AS DOUBLE) AS c0").
  write.mode("overwrite").json(path)

sqlContext.read.json(path).show()

++
||
++
||
||
++

// A single row containing two null doubles, saving to JSON, wrong
sqlContext.
  range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0", "CAST(NULL AS DOUBLE) AS 
c1").
  write.mode("overwrite").json(path)

sqlContext.read.json(path).show()

++
||
++
||
++

// A single row containing a single null double, saving to Parquet, OK
sqlContext.
  range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0").
  write.mode("overwrite").parquet(path)

sqlContext.read.parquet(path).show()

++
|   d|
++
|null|
++

// Two rows, one containing a single null double, one containing non-null 
double, saving to JSON, OK
sqlContext.
  range(2).selectExpr("IF(id % 2 = 0, CAST(NULL AS DOUBLE), id) AS c0").
  write.mode("overwrite").json(path)

sqlContext.read.json(path).show()

++
|   d|
++
|null|
| 1.0|
++
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10587) In pyspark, toDF() dosen't exsist in RDD object

2015-09-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743334#comment-14743334
 ] 

Sean Owen commented on SPARK-10587:
---

It's in {{python/pyspark/sql/context.py}}. Are you sure your imports are in 
order? This is probably a question for user@, not a JIRA at this point. 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> In pyspark, toDF() dosen't exsist in RDD object
> ---
>
> Key: SPARK-10587
> URL: https://issues.apache.org/jira/browse/SPARK-10587
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: SemiCoder
>
> I can't find toDF() function in RDD.
> In pyspark.mllib.linalg.distributed ,  the IndexedRowMatrix.__init__() 
> require the rows should be an RDD and execute rows.toDF() but actually the 
> RDD in pyspark dosen't have toDF() function



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10589) Add defense against external site framing

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10589:


Assignee: Apache Spark  (was: Sean Owen)

> Add defense against external site framing
> -
>
> Key: SPARK-10589
> URL: https://issues.apache.org/jira/browse/SPARK-10589
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> This came up as a minor point during a security audit using a common scanning 
> tool: It's best if Spark UIs try to actively defend against certain types of 
> frame-related vulnerabilities by setting X-Frame-Options. See 
> https://www.owasp.org/index.php/Clickjacking_Defense_Cheat_Sheet
> Easy PR coming ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2960) Spark executables fail to start via symlinks

2015-09-14 Thread Danil Mironov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743415#comment-14743415
 ] 

Danil Mironov commented on SPARK-2960:
--

The title of the issue is not that misleading, when one doesn't get 
spark-something running after 'spark-something' was typed in, it's commonly 
known as 'spark executables fail to start'. Following the symlinks does fix the 
issue in hand.

Having executables {quote}configured by {{SPARK_HOME}} and/or 
{{SPARK_CONF_DIR}} {quote} would be a nice solution, I'd vote for that.
This implies scripts treating those configurations as read-only and quitting 
early and loudly if the latter is missing or crippled. 
That's some rework though, not a bug to fix.

> Spark executables fail to start via symlinks
> 
>
> Key: SPARK-2960
> URL: https://issues.apache.org/jira/browse/SPARK-2960
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Shay Rojansky
>Priority: Minor
>
> The current scripts (e.g. pyspark) fail to run when they are executed via 
> symlinks. A common Linux scenario would be to have Spark installed somewhere 
> (e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10589) Add defense against external site framing

2015-09-14 Thread Sean Owen (JIRA)
Sean Owen created SPARK-10589:
-

 Summary: Add defense against external site framing
 Key: SPARK-10589
 URL: https://issues.apache.org/jira/browse/SPARK-10589
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.5.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor


This came up as a minor point during a security audit using a common scanning 
tool: It's best if Spark UIs try to actively defend against certain types of 
frame-related vulnerabilities by setting X-Frame-Options. See 
https://www.owasp.org/index.php/Clickjacking_Defense_Cheat_Sheet

Easy PR coming ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743421#comment-14743421
 ] 

Apache Spark commented on SPARK-1537:
-

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/8744

> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10577) [PySpark] DataFrame hint for broadcast join

2015-09-14 Thread Jian Feng Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743426#comment-14743426
 ] 

Jian Feng Zhang commented on SPARK-10577:
-

I'd like to take this to create a pull request.

> [PySpark] DataFrame hint for broadcast join
> ---
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej BryƄski
>  Labels: starter
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10589) Add defense against external site framing

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10589:


Assignee: Sean Owen  (was: Apache Spark)

> Add defense against external site framing
> -
>
> Key: SPARK-10589
> URL: https://issues.apache.org/jira/browse/SPARK-10589
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> This came up as a minor point during a security audit using a common scanning 
> tool: It's best if Spark UIs try to actively defend against certain types of 
> frame-related vulnerabilities by setting X-Frame-Options. See 
> https://www.owasp.org/index.php/Clickjacking_Defense_Cheat_Sheet
> Easy PR coming ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10589) Add defense against external site framing

2015-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743470#comment-14743470
 ] 

Apache Spark commented on SPARK-10589:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8745

> Add defense against external site framing
> -
>
> Key: SPARK-10589
> URL: https://issues.apache.org/jira/browse/SPARK-10589
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> This came up as a minor point during a security audit using a common scanning 
> tool: It's best if Spark UIs try to actively defend against certain types of 
> frame-related vulnerabilities by setting X-Frame-Options. See 
> https://www.owasp.org/index.php/Clickjacking_Defense_Cheat_Sheet
> Easy PR coming ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access

2015-09-14 Thread Rustam Aliyev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rustam Aliyev updated SPARK-7442:
-
Comment: was deleted

(was: Hit this bug today. It basically makes Spark on AWS useless for many 
scenarios. Please prioritise.)

> Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
> -
>
> Key: SPARK-7442
> URL: https://issues.apache.org/jira/browse/SPARK-7442
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.1
> Environment: OS X
>Reporter: Nicholas Chammas
>
> # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads 
> page|http://spark.apache.org/downloads.html].
> # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}}
> # Fire up PySpark and try reading from S3 with something like this:
> {code}sc.textFile('s3n://bucket/file_*').count(){code}
> # You will get an error like this:
> {code}py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.io.IOException: No FileSystem for scheme: s3n{code}
> {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 
> works.
> It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 
> that doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4815) ThriftServer use only one SessionState to run sql using hive

2015-09-14 Thread Joseph Fourny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743555#comment-14743555
 ] 

Joseph Fourny commented on SPARK-4815:
--

Is this really fixed? I am on Spark 1.5.0 (rc3) and I see very little isolation 
between JDBC connections to the ThriftServer. For example, "SET X=Y" or "USE 
DATABASE X" on one connection immediately affects all other connections. This 
is extremely undesirable behavior. Was there a regression at some point?

> ThriftServer use only one SessionState to run sql using hive 
> -
>
> Key: SPARK-4815
> URL: https://issues.apache.org/jira/browse/SPARK-4815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: guowei
>
> ThriftServer use only one SessionState to run sql using hive, though it from 
> different hive sessions.
> This will make mistakes:
> For example, one user run "use database" in one beeline client. the database 
> in other  beeline change too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2015-09-14 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743531#comment-14743531
 ] 

Steve Loughran commented on SPARK-2356:
---

The original JIRA here is just that there's an error being printed out; in that 
specific example it is just noise. You can set the log in log4j to tell it not 
to log anything in {{org.apache.hadoop.util.Shell}} and you won't see this 
text. The other issues people are finding are actual problems; Hadoop and the 
libraries underneath are trying to load WINUTILS.EXE for real work -and failing

> Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
> ---
>
> Key: SPARK-2356
> URL: https://issues.apache.org/jira/browse/SPARK-2356
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 1.0.0
>Reporter: Kostiantyn Kudriavtsev
>Priority: Critical
>
> I'm trying to run some transformation on Spark, it works fine on cluster 
> (YARN, linux machines). However, when I'm trying to run it on local machine 
> (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
> from local filesystem):
> {code}
> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
>   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>   at org.apache.hadoop.util.Shell.(Shell.java:326)
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>   at org.apache.hadoop.security.Groups.(Groups.java:77)
>   at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>   at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
>   at org.apache.spark.SparkContext.(SparkContext.scala:228)
>   at org.apache.spark.SparkContext.(SparkContext.scala:97)
> {code}
> It's happened because Hadoop config is initialized each time when spark 
> context is created regardless is hadoop required or not.
> I propose to add some special flag to indicate if hadoop config is required 
> (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6961) Cannot save data to parquet files when executing from Windows from a Maven Project

2015-09-14 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743537#comment-14743537
 ] 

Steve Loughran commented on SPARK-6961:
---

Well, its an installation-side issue in that "if it isn't there you can fix it 
with a re-installation". 

The fact that things are failing with an utterly useless error message, that is 
very much a code-side issue. HADOOP-10775 is going to add extra checks and a 
link to a wiki entry (https://wiki.apache.org/hadoop/WindowsProblems) with some 
advice. One troublespot there is that code is often just referencing a field 
(which is set to null on a load failure); the patch will have to make sure we 
switch to exception-raising getters as needed, and that the callers handle the 
raised exceptions properly.

> Cannot save data to parquet files when executing from Windows from a Maven 
> Project
> --
>
> Key: SPARK-6961
> URL: https://issues.apache.org/jira/browse/SPARK-6961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Bogdan Niculescu
>Priority: Critical
>
> I have setup a project where I am trying to save a DataFrame into a parquet 
> file. My project is a Maven one with Spark 1.3.0 and Scala 2.11.5 :
> {code:xml}
> 1.3.0
> 
> org.apache.spark
> spark-core_2.11
> ${spark.version}
> 
> 
> org.apache.spark
> spark-sql_2.11
> ${spark.version}
> 
> {code}
> A simple version of my code that reproduces consistently the problem that I 
> am seeing is :
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> case class Person(name: String, age: Int)
> object DataFrameTest extends App {
>   val conf = new SparkConf().setMaster("local[4]").setAppName("DataFrameTest")
>   val sc = new SparkContext(conf)
>   val sqlContext = new SQLContext(sc)
>   val persons = List(Person("a", 1), Person("b", 2))
>   val rdd = sc.parallelize(persons)
>   val dataFrame = sqlContext.createDataFrame(rdd)
>   dataFrame.saveAsParquetFile("test.parquet")
> }
> {code}
> All the time the exception that I am getting is :
> {code}
> Exception in thread "main" java.lang.NullPointerException
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
>   at org.apache.hadoop.util.Shell.run(Shell.java:379)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:772)
>   at 
> parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:409)
>   at 
> parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:401)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:443)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.prepareMetadata(newParquet.scala:240)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:256)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scala:370)
>   at 
> org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:96)
>   at 
> org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:125)
>   at 

[jira] [Commented] (SPARK-10550) SQLListener error constructing extended SQLContext

2015-09-14 Thread shao lo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743572#comment-14743572
 ] 

shao lo commented on SPARK-10550:
-

There are parts that are marked as experimental.  This is not in that category. 
 The reason to make a class have protected access is exactly to promote 
extension.

> SQLListener error constructing extended SQLContext 
> ---
>
> Key: SPARK-10550
> URL: https://issues.apache.org/jira/browse/SPARK-10550
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: shao lo
>Priority: Minor
>
> With spark 1.4.1 I was able to created a custom SQLContext class.  With spark 
> 1.5.0, I now get an error  calling the super class constructor.  The problem 
> is related to this new code that was added between 1.4.1 and 1.5.0
>   // `listener` should be only used in the driver
>   @transient private[sql] val listener = new SQLListener(this)
>   sparkContext.addSparkListener(listener)
>   sparkContext.ui.foreach(new SQLTab(this, _))
> ..which generates 
> Exception in thread "main" java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.(SQLListener.scala:34)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:77)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access

2015-09-14 Thread Rustam Aliyev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743539#comment-14743539
 ] 

Rustam Aliyev commented on SPARK-7442:
--

Hit this bug today. It basically makes Spark on AWS useless for many scenarios. 
Please prioritise.

> Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
> -
>
> Key: SPARK-7442
> URL: https://issues.apache.org/jira/browse/SPARK-7442
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.3.1
> Environment: OS X
>Reporter: Nicholas Chammas
>
> # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads 
> page|http://spark.apache.org/downloads.html].
> # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}}
> # Fire up PySpark and try reading from S3 with something like this:
> {code}sc.textFile('s3n://bucket/file_*').count(){code}
> # You will get an error like this:
> {code}py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : java.io.IOException: No FileSystem for scheme: s3n{code}
> {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 
> works.
> It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 
> that doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10590) Spark with YARN build is broken

2015-09-14 Thread Kevin Tsai (JIRA)
Kevin Tsai created SPARK-10590:
--

 Summary: Spark with YARN build is broken
 Key: SPARK-10590
 URL: https://issues.apache.org/jira/browse/SPARK-10590
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
 Environment: CentOS 6.5
Maven 3.3.3
Hadoop 2.6.0
Spark 1.5.0

Reporter: Kevin Tsai


Hi, After upgrade to v1.5.0 and trying to build it.

It shows:
[ERROR] missing or invalid dependency detected while loading class file 
'WebUI.class'

It was working on Spark 1.4.1
Build command: mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive 
-Phive-thriftserver -Dscala-2.11 -DskipTests clean package
Hope it helps.

Regards,
Kevin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7012) Add support for NOT NULL modifier for column definitions on DDLParser

2015-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743500#comment-14743500
 ] 

Apache Spark commented on SPARK-7012:
-

User 'sabhyankar' has created a pull request for this issue:
https://github.com/apache/spark/pull/8746

> Add support for NOT NULL modifier for column definitions on DDLParser
> -
>
> Key: SPARK-7012
> URL: https://issues.apache.org/jira/browse/SPARK-7012
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Santiago M. Mola
>Priority: Minor
>  Labels: easyfix
>
> Add support for NOT NULL modifier for column definitions on DDLParser. This 
> would add support for the following syntax:
> CREATE TEMPORARY TABLE (field INTEGER NOT NULL) ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7012) Add support for NOT NULL modifier for column definitions on DDLParser

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7012:
---

Assignee: Apache Spark

> Add support for NOT NULL modifier for column definitions on DDLParser
> -
>
> Key: SPARK-7012
> URL: https://issues.apache.org/jira/browse/SPARK-7012
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Santiago M. Mola
>Assignee: Apache Spark
>Priority: Minor
>  Labels: easyfix
>
> Add support for NOT NULL modifier for column definitions on DDLParser. This 
> would add support for the following syntax:
> CREATE TEMPORARY TABLE (field INTEGER NOT NULL) ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7012) Add support for NOT NULL modifier for column definitions on DDLParser

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7012:
---

Assignee: (was: Apache Spark)

> Add support for NOT NULL modifier for column definitions on DDLParser
> -
>
> Key: SPARK-7012
> URL: https://issues.apache.org/jira/browse/SPARK-7012
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Santiago M. Mola
>Priority: Minor
>  Labels: easyfix
>
> Add support for NOT NULL modifier for column definitions on DDLParser. This 
> would add support for the following syntax:
> CREATE TEMPORARY TABLE (field INTEGER NOT NULL) ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10590) Spark with YARN build is broken

2015-09-14 Thread Kevin Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Tsai updated SPARK-10590:
---
Environment: 
CentOS 6.5
Oracle JDK 1.7.0_75
Maven 3.3.3
Hadoop 2.6.0
Spark 1.5.0


  was:
CentOS 6.5
Maven 3.3.3
Hadoop 2.6.0
Spark 1.5.0



> Spark with YARN build is broken
> ---
>
> Key: SPARK-10590
> URL: https://issues.apache.org/jira/browse/SPARK-10590
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: CentOS 6.5
> Oracle JDK 1.7.0_75
> Maven 3.3.3
> Hadoop 2.6.0
> Spark 1.5.0
>Reporter: Kevin Tsai
>
> Hi, After upgrade to v1.5.0 and trying to build it.
> It shows:
> [ERROR] missing or invalid dependency detected while loading class file 
> 'WebUI.class'
> It was working on Spark 1.4.1
> Build command: mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive 
> -Phive-thriftserver -Dscala-2.11 -DskipTests clean package
> Hope it helps.
> Regards,
> Kevin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10458) Would like to know if a given Spark Context is stopped or currently stopping

2015-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743777#comment-14743777
 ] 

Apache Spark commented on SPARK-10458:
--

User 'kmadhugit' has created a pull request for this issue:
https://github.com/apache/spark/pull/8749

> Would like to know if a given Spark Context is stopped or currently stopping
> 
>
> Key: SPARK-10458
> URL: https://issues.apache.org/jira/browse/SPARK-10458
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matt Cheah
>Priority: Minor
>
> I ran into a case where a thread stopped a Spark Context, specifically when I 
> hit the "kill" link from the Spark standalone UI. There was no real way for 
> another thread to know that the context had stopped and thus should have 
> handled that accordingly.
> Checking that the SparkEnv is null is one way, but that doesn't handle the 
> case where the context is in the midst of stopping, and stopping the context 
> may actually not be instantaneous - in my case for some reason the 
> DAGScheduler was taking a non-trivial amount of time to stop.
> Implementation wise I'm more or less requesting the boolean value returned 
> from SparkContext.stopped.get() to be visible in some way. As long as we 
> return the value and not the Atomic Boolean itself (we wouldn't want anyone 
> to be setting this, after all!) it would help client applications check the 
> context's liveliness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10458) Would like to know if a given Spark Context is stopped or currently stopping

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10458:


Assignee: Apache Spark

> Would like to know if a given Spark Context is stopped or currently stopping
> 
>
> Key: SPARK-10458
> URL: https://issues.apache.org/jira/browse/SPARK-10458
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matt Cheah
>Assignee: Apache Spark
>Priority: Minor
>
> I ran into a case where a thread stopped a Spark Context, specifically when I 
> hit the "kill" link from the Spark standalone UI. There was no real way for 
> another thread to know that the context had stopped and thus should have 
> handled that accordingly.
> Checking that the SparkEnv is null is one way, but that doesn't handle the 
> case where the context is in the midst of stopping, and stopping the context 
> may actually not be instantaneous - in my case for some reason the 
> DAGScheduler was taking a non-trivial amount of time to stop.
> Implementation wise I'm more or less requesting the boolean value returned 
> from SparkContext.stopped.get() to be visible in some way. As long as we 
> return the value and not the Atomic Boolean itself (we wouldn't want anyone 
> to be setting this, after all!) it would help client applications check the 
> context's liveliness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10458) Would like to know if a given Spark Context is stopped or currently stopping

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10458:


Assignee: (was: Apache Spark)

> Would like to know if a given Spark Context is stopped or currently stopping
> 
>
> Key: SPARK-10458
> URL: https://issues.apache.org/jira/browse/SPARK-10458
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matt Cheah
>Priority: Minor
>
> I ran into a case where a thread stopped a Spark Context, specifically when I 
> hit the "kill" link from the Spark standalone UI. There was no real way for 
> another thread to know that the context had stopped and thus should have 
> handled that accordingly.
> Checking that the SparkEnv is null is one way, but that doesn't handle the 
> case where the context is in the midst of stopping, and stopping the context 
> may actually not be instantaneous - in my case for some reason the 
> DAGScheduler was taking a non-trivial amount of time to stop.
> Implementation wise I'm more or less requesting the boolean value returned 
> from SparkContext.stopped.get() to be visible in some way. As long as we 
> return the value and not the Atomic Boolean itself (we wouldn't want anyone 
> to be setting this, after all!) it would help client applications check the 
> context's liveliness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10550) SQLListener error constructing extended SQLContext

2015-09-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743642#comment-14743642
 ] 

Sean Owen commented on SPARK-10550:
---

It's marked {{protected[sql]}} which means it is not accessible outside 
{{org.apache.spark.sql}}. It can't be an API as such, not even 'experimental'. 
You're kind of at your own risk if you're trying to access things like this, as 
they may change from version to version. (It ends up being merely "protected" 
in the bytecode since the JVM has no similar notion of "protected with respect 
to a package" though.) This is why I'm not sure this can be considered a 'bug' 
as I understand what you're trying to do.

> SQLListener error constructing extended SQLContext 
> ---
>
> Key: SPARK-10550
> URL: https://issues.apache.org/jira/browse/SPARK-10550
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: shao lo
>Priority: Minor
>
> With spark 1.4.1 I was able to created a custom SQLContext class.  With spark 
> 1.5.0, I now get an error  calling the super class constructor.  The problem 
> is related to this new code that was added between 1.4.1 and 1.5.0
>   // `listener` should be only used in the driver
>   @transient private[sql] val listener = new SQLListener(this)
>   sparkContext.addSparkListener(listener)
>   sparkContext.ui.foreach(new SQLTab(this, _))
> ..which generates 
> Exception in thread "main" java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.(SQLListener.scala:34)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:77)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10590) Spark with YARN build is broken

2015-09-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743666#comment-14743666
 ] 

Sean Owen commented on SPARK-10590:
---

Did you run the script to set up the build for Scala 2.11 first? otherwise this 
probably won't work indeed.

> Spark with YARN build is broken
> ---
>
> Key: SPARK-10590
> URL: https://issues.apache.org/jira/browse/SPARK-10590
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: CentOS 6.5
> Oracle JDK 1.7.0_75
> Maven 3.3.3
> Hadoop 2.6.0
> Spark 1.5.0
>Reporter: Kevin Tsai
>
> Hi, After upgrade to v1.5.0 and trying to build it.
> It shows:
> [ERROR] missing or invalid dependency detected while loading class file 
> 'WebUI.class'
> It was working on Spark 1.4.1
> Build command: mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive 
> -Phive-thriftserver -Dscala-2.11 -DskipTests clean package
> Hope it helps.
> Regards,
> Kevin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10588) Saving a DataFrame containing only nulls to JSON doesn't work

2015-09-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743770#comment-14743770
 ] 

Yin Huai commented on SPARK-10588:
--

This is an expected behavior. When we write a row out, we skip those null 
values, which is pretty useful to save space when writing sparse data to json.

One possible way to address this issue is to write null values only for the 
first row generated by a writer.

> Saving a DataFrame containing only nulls to JSON doesn't work
> -
>
> Key: SPARK-10588
> URL: https://issues.apache.org/jira/browse/SPARK-10588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>
> Snippets to reproduce this issue:
> {noformat}
> val path = "file:///tmp/spark/null"
> // A single row containing a single null double, saving to JSON, wrong
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ++
> // Two rows each containing a single null double, saving to JSON, wrong
> sqlContext.
>   range(2).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ||
> ++
> // A single row containing two null doubles, saving to JSON, wrong
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0", "CAST(NULL AS DOUBLE) AS 
> c1").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ++
> // A single row containing a single null double, saving to Parquet, OK
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").parquet(path)
> sqlContext.read.parquet(path).show()
> ++
> |   d|
> ++
> |null|
> ++
> // Two rows, one containing a single null double, one containing non-null 
> double, saving to JSON, OK
> sqlContext.
>   range(2).selectExpr("IF(id % 2 = 0, CAST(NULL AS DOUBLE), id) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> |   d|
> ++
> |null|
> | 1.0|
> ++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10588) Saving a DataFrame containing only nulls to JSON doesn't work

2015-09-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10588:
-
Priority: Minor  (was: Major)

> Saving a DataFrame containing only nulls to JSON doesn't work
> -
>
> Key: SPARK-10588
> URL: https://issues.apache.org/jira/browse/SPARK-10588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Priority: Minor
>
> Snippets to reproduce this issue:
> {noformat}
> val path = "file:///tmp/spark/null"
> // A single row containing a single null double, saving to JSON, wrong
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ++
> // Two rows each containing a single null double, saving to JSON, wrong
> sqlContext.
>   range(2).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ||
> ++
> // A single row containing two null doubles, saving to JSON, wrong
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0", "CAST(NULL AS DOUBLE) AS 
> c1").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ++
> // A single row containing a single null double, saving to Parquet, OK
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").parquet(path)
> sqlContext.read.parquet(path).show()
> ++
> |   d|
> ++
> |null|
> ++
> // Two rows, one containing a single null double, one containing non-null 
> double, saving to JSON, OK
> sqlContext.
>   range(2).selectExpr("IF(id % 2 = 0, CAST(NULL AS DOUBLE), id) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> |   d|
> ++
> |null|
> | 1.0|
> ++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10585) only copy data once when generate unsafe projection

2015-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743677#comment-14743677
 ] 

Apache Spark commented on SPARK-10585:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8747

> only copy data once when generate unsafe projection
> ---
>
> Key: SPARK-10585
> URL: https://issues.apache.org/jira/browse/SPARK-10585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> When we have nested struct, array or map, we will create a byte buffer for 
> each of them, and copy data to the buffer first, then copy them to the final 
> row buffer. We can save the first copy and directly copy data to final row 
> buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10585) only copy data once when generate unsafe projection

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10585:


Assignee: Apache Spark

> only copy data once when generate unsafe projection
> ---
>
> Key: SPARK-10585
> URL: https://issues.apache.org/jira/browse/SPARK-10585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> When we have nested struct, array or map, we will create a byte buffer for 
> each of them, and copy data to the buffer first, then copy them to the final 
> row buffer. We can save the first copy and directly copy data to final row 
> buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10585) only copy data once when generate unsafe projection

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10585:


Assignee: (was: Apache Spark)

> only copy data once when generate unsafe projection
> ---
>
> Key: SPARK-10585
> URL: https://issues.apache.org/jira/browse/SPARK-10585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> When we have nested struct, array or map, we will create a byte buffer for 
> each of them, and copy data to the buffer first, then copy them to the final 
> row buffer. We can save the first copy and directly copy data to final row 
> buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns

2015-09-14 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743759#comment-14743759
 ] 

Shivaram Venkataraman commented on SPARK-9325:
--

Thanks [~felixcheung] for investigating into this. I see the problem that we 
need a handle to the DataFrame in order to be able to collect a column. I can 
think of couple of ways to solve this: 
One is to save an optional handle to the DataFrame in the R side and then if 
the handle is available we will support collect. i.e. if the column was created 
using some other method (say col("name") then we won't support collect). 

The other is to add a method on the Scala side which can return the data frame 
handle or do the selection for us if the column is resolved -- [~davies] or 
[~rxin] might be able to comment more on this.

> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6417) Add Linear Programming algorithm

2015-09-14 Thread Ehsan Mohyedin Kermani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743896#comment-14743896
 ] 

Ehsan Mohyedin Kermani commented on SPARK-6417:
---

Thank you Joseph for the advice! I have started with the starter kit and fixed 
some annotations to get a sense of contributing to Spark. I am going to work on 
the LP implementations and perhaps submit it as a package.  

Regards

> Add Linear Programming algorithm 
> -
>
> Key: SPARK-6417
> URL: https://issues.apache.org/jira/browse/SPARK-6417
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Fan Jiang
>  Labels: features
>
> Linear programming is the problem of finding a vector x that minimizes a 
> linear function fTx subject to linear constraints:
> minxfTx
> such that one or more of the following hold: A·x ≀ b, Aeq·x = beq, l ≀ x ≀ u.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10590) Spark with YARN build is broken

2015-09-14 Thread Kevin Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743813#comment-14743813
 ] 

Kevin Tsai commented on SPARK-10590:


Hi Sean,
The result is same as previous when I build it after installed the Scala 2.11.7

Here is the result:
...
[ERROR] missing or invalid dependency detected while loading class file 
'WebUI.class'.
Could not access term jetty in value org.eclipse,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the 
problematic classpath.)
A full rebuild may help if 'WebUI.class' was compiled against an incompatible 
version of org.eclipse.
[WARNING] 22 warnings found
[ERROR] two errors found.
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [  7.360 s]
[INFO] Spark Project Core . SUCCESS [05:41 min]
[INFO] Spark Project Bagel  SUCCESS [ 40.951 s]
[INFO] Spark Project GraphX ... SUCCESS [01:41 min]
[INFO] Spark Project ML Library ... SUCCESS [04:05 min]
[INFO] Spark Project Tools  SUCCESS [ 20.053 s]
[INFO] Spark Project Networking ... SUCCESS [ 10.914 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  6.852 s]
[INFO] Spark Project Streaming  SUCCESS [02:38 min]
[INFO] Spark Project Catalyst . SUCCESS [03:16 min]
[INFO] Spark Project SQL .. FAILURE [01:22 min]


> Spark with YARN build is broken
> ---
>
> Key: SPARK-10590
> URL: https://issues.apache.org/jira/browse/SPARK-10590
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: CentOS 6.5
> Oracle JDK 1.7.0_75
> Maven 3.3.3
> Hadoop 2.6.0
> Spark 1.5.0
>Reporter: Kevin Tsai
>
> Hi, After upgrade to v1.5.0 and trying to build it.
> It shows:
> [ERROR] missing or invalid dependency detected while loading class file 
> 'WebUI.class'
> It was working on Spark 1.4.1
> Build command: mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive 
> -Phive-thriftserver -Dscala-2.11 -DskipTests clean package
> Hope it helps.
> Regards,
> Kevin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10579) Extend statistical functions: Add Cardinality/Quantiles/Quartiles/Median in Statistics , e.g. for columns

2015-09-14 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743819#comment-14743819
 ] 

Joseph K. Bradley commented on SPARK-10579:
---

A lot of this functionality is being added to DataFrames instead.  I'd 
recommend examining what DataFrames provides (and what JIRAs are there) & 
opening up JIRAs as need for each function you're interested in.  I'll close 
this for now but will keep watching.  Thanks!

> Extend statistical functions: Add Cardinality/Quantiles/Quartiles/Median in 
> Statistics , e.g. for columns
> -
>
> Key: SPARK-10579
> URL: https://issues.apache.org/jira/browse/SPARK-10579
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Narine Kokhlikyan
>Priority: Minor
> Fix For: 1.6.0
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Hi everyone,
> I think it would be good to extend statistical functions in mllib package, by 
> adding  Cardinality/Quantiles/Quartiles/Median for the columns, as many other 
> ml and statistical libraries already have it. I couldn't find it in mllib 
> package, hence would like to suggest it.
> Since this is my first time working with jira, I'd truly appreciate if 
> someone could review this and let me know what do you think. 
> Also, I'd really like to work on it and looking forward to hearing from you!
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10579) Extend statistical functions: Add Cardinality/Quantiles/Quartiles/Median in Statistics , e.g. for columns

2015-09-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-10579.
-
Resolution: Won't Fix

> Extend statistical functions: Add Cardinality/Quantiles/Quartiles/Median in 
> Statistics , e.g. for columns
> -
>
> Key: SPARK-10579
> URL: https://issues.apache.org/jira/browse/SPARK-10579
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Narine Kokhlikyan
>Priority: Minor
> Fix For: 1.6.0
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Hi everyone,
> I think it would be good to extend statistical functions in mllib package, by 
> adding  Cardinality/Quantiles/Quartiles/Median for the columns, as many other 
> ml and statistical libraries already have it. I couldn't find it in mllib 
> package, hence would like to suggest it.
> Since this is my first time working with jira, I'd truly appreciate if 
> someone could review this and let me know what do you think. 
> Also, I'd really like to work on it and looking forward to hearing from you!
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10588) Saving a DataFrame containing only nulls to JSON doesn't work

2015-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743840#comment-14743840
 ] 

Apache Spark commented on SPARK-10588:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/8750

> Saving a DataFrame containing only nulls to JSON doesn't work
> -
>
> Key: SPARK-10588
> URL: https://issues.apache.org/jira/browse/SPARK-10588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Priority: Minor
>
> Snippets to reproduce this issue:
> {noformat}
> val path = "file:///tmp/spark/null"
> // A single row containing a single null double, saving to JSON, wrong
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ++
> // Two rows each containing a single null double, saving to JSON, wrong
> sqlContext.
>   range(2).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ||
> ++
> // A single row containing two null doubles, saving to JSON, wrong
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0", "CAST(NULL AS DOUBLE) AS 
> c1").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ++
> // A single row containing a single null double, saving to Parquet, OK
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").parquet(path)
> sqlContext.read.parquet(path).show()
> ++
> |   d|
> ++
> |null|
> ++
> // Two rows, one containing a single null double, one containing non-null 
> double, saving to JSON, OK
> sqlContext.
>   range(2).selectExpr("IF(id % 2 = 0, CAST(NULL AS DOUBLE), id) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> |   d|
> ++
> |null|
> | 1.0|
> ++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10588) Saving a DataFrame containing only nulls to JSON doesn't work

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10588:


Assignee: (was: Apache Spark)

> Saving a DataFrame containing only nulls to JSON doesn't work
> -
>
> Key: SPARK-10588
> URL: https://issues.apache.org/jira/browse/SPARK-10588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Priority: Minor
>
> Snippets to reproduce this issue:
> {noformat}
> val path = "file:///tmp/spark/null"
> // A single row containing a single null double, saving to JSON, wrong
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ++
> // Two rows each containing a single null double, saving to JSON, wrong
> sqlContext.
>   range(2).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ||
> ++
> // A single row containing two null doubles, saving to JSON, wrong
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0", "CAST(NULL AS DOUBLE) AS 
> c1").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ++
> // A single row containing a single null double, saving to Parquet, OK
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").parquet(path)
> sqlContext.read.parquet(path).show()
> ++
> |   d|
> ++
> |null|
> ++
> // Two rows, one containing a single null double, one containing non-null 
> double, saving to JSON, OK
> sqlContext.
>   range(2).selectExpr("IF(id % 2 = 0, CAST(NULL AS DOUBLE), id) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> |   d|
> ++
> |null|
> | 1.0|
> ++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10588) Saving a DataFrame containing only nulls to JSON doesn't work

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10588:


Assignee: Apache Spark

> Saving a DataFrame containing only nulls to JSON doesn't work
> -
>
> Key: SPARK-10588
> URL: https://issues.apache.org/jira/browse/SPARK-10588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Minor
>
> Snippets to reproduce this issue:
> {noformat}
> val path = "file:///tmp/spark/null"
> // A single row containing a single null double, saving to JSON, wrong
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ++
> // Two rows each containing a single null double, saving to JSON, wrong
> sqlContext.
>   range(2).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ||
> ++
> // A single row containing two null doubles, saving to JSON, wrong
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0", "CAST(NULL AS DOUBLE) AS 
> c1").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ++
> // A single row containing a single null double, saving to Parquet, OK
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").parquet(path)
> sqlContext.read.parquet(path).show()
> ++
> |   d|
> ++
> |null|
> ++
> // Two rows, one containing a single null double, one containing non-null 
> double, saving to JSON, OK
> sqlContext.
>   range(2).selectExpr("IF(id % 2 = 0, CAST(NULL AS DOUBLE), id) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> |   d|
> ++
> |null|
> | 1.0|
> ++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10591) False negative in QueryTest.checkAnswer

2015-09-14 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-10591:
--

 Summary: False negative in QueryTest.checkAnswer
 Key: SPARK-10591
 URL: https://issues.apache.org/jira/browse/SPARK-10591
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.5.0, 1.4.1, 1.3.1, 1.2.2, 1.1.1, 1.0.2
Reporter: Cheng Lian


# For double and float, `NaN == NaN` is always `false`
# `checkAnswer` doesn't handle `Map` properly. For example:
  {noformat}
  scala> Map(1 -> 2, 2 -> 1).toString
  res0: String = Map(1 -> 2, 2 -> 1)

  scala> Map(2 -> 1, 1 -> 2).toString
  res1: String = Map(2 -> 1, 1 -> 2)
  {noformat}
  We can't rely on `toString` to compare `Map` instances.

Need to update `checkAnswer` to special case `NaN` and `Map`.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10573) IndexToString transformSchema adds output field as DoubleType

2015-09-14 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743825#comment-14743825
 ] 

Joseph K. Bradley commented on SPARK-10573:
---

I think your assessment is correct.  Would you mind sending a PR?  Thanks!

> IndexToString transformSchema adds output field as DoubleType
> -
>
> Key: SPARK-10573
> URL: https://issues.apache.org/jira/browse/SPARK-10573
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Nick Pritchard
>
> Reproducible example:
> {code}
> val stage = new IndexToString().setInputCol("input").setOutputCol("output")
> val inSchema = StructType(Seq(StructField("input", DoubleType)))
> val outSchema = stage.transformSchema(inSchema)
> assert(outSchema("output").dataType == StringType)
> {code}
> The root cause seems to be that it uses {{NominalAttribute.toStructField}} 
> which assumes {{DoubleType}}. It would probably be better to just use 
> {{SchemaUtils.appendColumn}} and explicitly set the data type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10578) pyspark.ml.classification.RandomForestClassifer does not return `rawPrediction` column

2015-09-14 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10578.
---
   Resolution: Fixed
 Assignee: Joseph K. Bradley
Fix Version/s: 1.5.0

[~viirya] Yep, thanks for pointing out the right link!

> pyspark.ml.classification.RandomForestClassifer does not return 
> `rawPrediction` column
> --
>
> Key: SPARK-10578
> URL: https://issues.apache.org/jira/browse/SPARK-10578
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.4.0, 1.4.1
> Environment: CentOS, PySpark 1.4.1, Scala 2.10 
>Reporter: Karen Yin-Yee Ng
>Assignee: Joseph K. Bradley
> Fix For: 1.5.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> To use `pyspark.ml.classification.RandomForestClassifer` with 
> `BinaryClassificationEvaluator`, a column called `rawPrediction` needs to be 
> returned by the `RandomForestClassifer`. 
> The PySpark documentation example of `logisticsRegression`outputs the 
> `rawPrediction` column but not `RandomForestClassifier`.
> Therefore, one is unable to use `RandomForestClassifier` with the evaluator 
> nor put it in a pipeline with cross validation.
> A relevant piece of code showing how to reproduce the bug can be found at:
> https://gist.github.com/karenyyng/cf61ae655b032f754bfb
> A relevant post due to this possible bug can also be found at:
> http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-running-CrossValidator-with-RandomForestClassifier-on-dataset-td23791.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10574) HashingTF should use MurmurHash3

2015-09-14 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743834#comment-14743834
 ] 

Joseph K. Bradley commented on SPARK-10574:
---

I agree that switching to MurmurHash3 is a good idea.  As far as backwards 
compatibility, I feel like the best thing we can do is to provide a new 
parameter which lets the user choose the hashing method.  I would vote for 
having it default to MurmurHash3, with an option to switch to the old hashing 
method (but with proper warnings).

We have not really made promises about backwards compatibility for HashingTF, 
but we will need to start making such promises after adding save/load for 
Pipelines.  We can include a release note about this change.

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Priority: Critical
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10578) pyspark.ml.classification.RandomForestClassifer does not return `rawPrediction` column

2015-09-14 Thread Karen Yin-Yee Ng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743831#comment-14743831
 ] 

Karen Yin-Yee Ng commented on SPARK-10578:
--

Thanks [~josephkb] and [~viirya] for the quick response.

> pyspark.ml.classification.RandomForestClassifer does not return 
> `rawPrediction` column
> --
>
> Key: SPARK-10578
> URL: https://issues.apache.org/jira/browse/SPARK-10578
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.4.0, 1.4.1
> Environment: CentOS, PySpark 1.4.1, Scala 2.10 
>Reporter: Karen Yin-Yee Ng
>Assignee: Joseph K. Bradley
> Fix For: 1.5.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> To use `pyspark.ml.classification.RandomForestClassifer` with 
> `BinaryClassificationEvaluator`, a column called `rawPrediction` needs to be 
> returned by the `RandomForestClassifer`. 
> The PySpark documentation example of `logisticsRegression`outputs the 
> `rawPrediction` column but not `RandomForestClassifier`.
> Therefore, one is unable to use `RandomForestClassifier` with the evaluator 
> nor put it in a pipeline with cross validation.
> A relevant piece of code showing how to reproduce the bug can be found at:
> https://gist.github.com/karenyyng/cf61ae655b032f754bfb
> A relevant post due to this possible bug can also be found at:
> http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-running-CrossValidator-with-RandomForestClassifier-on-dataset-td23791.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10599) Decrease communication in BlockMatrix multiply and increase performance

2015-09-14 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-10599:

Description: 
The BlockMatrix multiply sends each block to all the corresponding columns of 
the right BlockMatrix, even though there might not be any corresponding block 
to multiply with.

Some optimizations we can perform are:
 - Simulate the multiplication on the driver, and figure out which blocks 
actually need to be shuffled
 - Send the block once to a partition, and join inside the partition rather 
than sending multiple copies to the same partition

> Decrease communication in BlockMatrix multiply and increase performance
> ---
>
> Key: SPARK-10599
> URL: https://issues.apache.org/jira/browse/SPARK-10599
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Burak Yavuz
>
> The BlockMatrix multiply sends each block to all the corresponding columns of 
> the right BlockMatrix, even though there might not be any corresponding block 
> to multiply with.
> Some optimizations we can perform are:
>  - Simulate the multiplication on the driver, and figure out which blocks 
> actually need to be shuffled
>  - Send the block once to a partition, and join inside the partition rather 
> than sending multiple copies to the same partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10600) SparkSQL - Support for Not Exists in a Correlated Subquery

2015-09-14 Thread Richard Garris (JIRA)
Richard Garris created SPARK-10600:
--

 Summary: SparkSQL - Support for Not Exists in a Correlated Subquery
 Key: SPARK-10600
 URL: https://issues.apache.org/jira/browse/SPARK-10600
 Project: Spark
  Issue Type: Improvement
Reporter: Richard Garris


Spark SQL currently does not support NOT EXISTS clauses (e.g. 

SELECT * FROM TABLE_A WHERE NOT EXISTS ( SELECT 1 FROM TABLE_B where TABLE_B.id 
= TABLE_A.id)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10597) MultivariateOnlineSummarizer for weighted instances

2015-09-14 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-10597:

Description: 
MultivariateOnlineSummarizer for weighted instances is implemented as private 
API for SPARK-7685.

In SPARK-7685, the online numerical stable version of unbiased estimation of 
variance defined by the reliability weights: 
[[https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights]] 
is implemented, but we would like to make it as public api since there are 
different use-cases.

Currently, `count` will return the actual number of instances, and ignores 
instance weights, but `numNonzeros` will return the weighted # of nonzeros. 

We need to decide the behavior of them before making it public.

  was:
MultivariateOnlineSummarizer for weighted instances is implemented as private 
API for #SPARK-7685.

In #SPARK-7685, the online numerical stable version of unbiased estimation of 
variance defined by the reliability weights: 
[[https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights]] 
is implemented, but we would like to make it as public api since there are 
different use-cases.

Currently, `count` will return the actual number of instances, and ignores 
instance weights, but `numNonzeros` will return the weighted # of nonzeros. 

We need to decide the behavior of them before making it public.


> MultivariateOnlineSummarizer for weighted instances
> ---
>
> Key: SPARK-10597
> URL: https://issues.apache.org/jira/browse/SPARK-10597
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: DB Tsai
>
> MultivariateOnlineSummarizer for weighted instances is implemented as private 
> API for SPARK-7685.
> In SPARK-7685, the online numerical stable version of unbiased estimation of 
> variance defined by the reliability weights: 
> [[https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights]]
>  is implemented, but we would like to make it as public api since there are 
> different use-cases.
> Currently, `count` will return the actual number of instances, and ignores 
> instance weights, but `numNonzeros` will return the weighted # of nonzeros. 
> We need to decide the behavior of them before making it public.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10597) MultivariateOnlineSummarizer for weighted instances

2015-09-14 Thread DB Tsai (JIRA)
DB Tsai created SPARK-10597:
---

 Summary: MultivariateOnlineSummarizer for weighted instances
 Key: SPARK-10597
 URL: https://issues.apache.org/jira/browse/SPARK-10597
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.5.0
Reporter: DB Tsai


MultivariateOnlineSummarizer for weighted instances is implemented as private 
API for #SPARK-7685.

In #SPARK-7685, the online numerical stable version of unbiased estimation of 
variance defined by the reliability weights: 
[[https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights]] 
is implemented, but we would like to make it as public api since there are 
different use-cases.

Currently, `count` will return the actual number of instances, and ignores 
instance weights, but `numNonzeros` will return the weighted # of nonzeros. 

We need to decide the behavior of them before making it public.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10594) ApplicationMaster "--help" references the removed "--num-executors" option

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10594:


Assignee: (was: Apache Spark)

> ApplicationMaster "--help" references the removed "--num-executors" option
> --
>
> Key: SPARK-10594
> URL: https://issues.apache.org/jira/browse/SPARK-10594
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0
>Reporter: Erick Tryzelaar
>Priority: Trivial
> Attachments: 
> 0001-SPARK-10594-YARN-Remove-reference-to-num-executors.patch, 
> 0002-SPARK-10594-YARN-Document-ApplicationMaster-properti.patch
>
>
> The issue SPARK-9092 and commit 
> [738f35|https://github.com/apache/spark/commit/738f353988dbf02704bd63f5e35d94402c59ed79]
>  removed the {{ApplicationMaster}} commandline argument {{--num-executors}}, 
> but it's help message still references the 
> [argument|https://github.com/apache/spark/blob/738f353988dbf02704bd63f5e35d94402c59ed79/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala#L108].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10593) sql lateral view same name gives wrong value

2015-09-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-10593:
--

Assignee: Davies Liu

> sql lateral view same name gives wrong value
> 
>
> Key: SPARK-10593
> URL: https://issues.apache.org/jira/browse/SPARK-10593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> This query will return wrong result:
> {code}
> select 
> insideLayer1.json as json_insideLayer1, 
> insideLayer2.json as json_insideLayer2 
> from (select '1' id) creatives 
> lateral view json_tuple('{"layer1": {"layer2": "text inside layer 2"}}', 
> 'layer1') insideLayer1 as json 
> lateral view json_tuple(insideLayer1.json, 'layer2') insideLayer2 as json 
> {code}
> It got 
> {code}
> ( {"layer2": "text inside layer 2"},  {"layer2": "text inside layer 2"})
> {code}
> instead of
> {code}
> ( {"layer2": "text inside layer 2"},  text inside layer 2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits

2015-09-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10598:
--
Description: (was: (Have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark please 
-- a number of these fields weren't quite right))

(Have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark please 
-- a number of these fields weren't quite right)

> RoutingTablePartition toMessage method refers to bytes instead of bits
> --
>
> Key: SPARK-10598
> URL: https://issues.apache.org/jira/browse/SPARK-10598
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Robin East
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits

2015-09-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10598:
--
Affects Version/s: (was: 1.4.0)
 Target Version/s:   (was: 1.5.0)
 Priority: Trivial  (was: Minor)
Fix Version/s: (was: 1.5.1)
  Description: (Have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark please 
-- a number of these fields weren't quite right)

> RoutingTablePartition toMessage method refers to bytes instead of bits
> --
>
> Key: SPARK-10598
> URL: https://issues.apache.org/jira/browse/SPARK-10598
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Robin East
>Priority: Trivial
>
> (Have a look at 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> please -- a number of these fields weren't quite right)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns

2015-09-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744170#comment-14744170
 ] 

Reynold Xin commented on SPARK-9325:


Do you want to support

collect(df$Age + 1) ?


> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10563) SparkContext's local properties should be cloned when inherited

2015-09-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10563:
--
Target Version/s: 1.6.0, 1.5.1  (was: 1.6.0)

> SparkContext's local properties should be cloned when inherited
> ---
>
> Key: SPARK-10563
> URL: https://issues.apache.org/jira/browse/SPARK-10563
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Currently, when a child thread inherits local properties from the parent 
> thread, it gets a reference of the parent's local properties and uses them as 
> default values.
> The problem, however, is that when the parent changes the value of an 
> existing property, the changes are reflected in the child thread! This has 
> very confusing semantics, especially in streaming.
> {code}
> private val localProperties = new InheritableThreadLocal[Properties] {
>   override protected def childValue(parent: Properties): Properties = new 
> Properties(parent)
>   override protected def initialValue(): Properties = new Properties()
> }
> {code}
> Instead, we should make a clone of the parent properties rather than passing 
> in a mutable reference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10599) Decrease communication in BlockMatrix multiply and increase performance

2015-09-14 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-10599:
---

 Summary: Decrease communication in BlockMatrix multiply and 
increase performance
 Key: SPARK-10599
 URL: https://issues.apache.org/jira/browse/SPARK-10599
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Burak Yavuz






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10599) Decrease communication in BlockMatrix multiply and increase performance

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10599:


Assignee: Apache Spark

> Decrease communication in BlockMatrix multiply and increase performance
> ---
>
> Key: SPARK-10599
> URL: https://issues.apache.org/jira/browse/SPARK-10599
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> The BlockMatrix multiply sends each block to all the corresponding columns of 
> the right BlockMatrix, even though there might not be any corresponding block 
> to multiply with.
> Some optimizations we can perform are:
>  - Simulate the multiplication on the driver, and figure out which blocks 
> actually need to be shuffled
>  - Send the block once to a partition, and join inside the partition rather 
> than sending multiple copies to the same partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits

2015-09-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10598:
--
Assignee: Robin East

> RoutingTablePartition toMessage method refers to bytes instead of bits
> --
>
> Key: SPARK-10598
> URL: https://issues.apache.org/jira/browse/SPARK-10598
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Robin East
>Assignee: Robin East
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6981) [SQL] SparkPlanner and QueryExecution should be factored out from SQLContext

2015-09-14 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6981.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 6356
[https://github.com/apache/spark/pull/6356]

> [SQL] SparkPlanner and QueryExecution should be factored out from SQLContext
> 
>
> Key: SPARK-6981
> URL: https://issues.apache.org/jira/browse/SPARK-6981
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Edoardo Vacchi
>Priority: Minor
> Fix For: 1.6.0
>
>
> In order to simplify extensibility with new strategies from third-parties, it 
> should be better to factor SparkPlanner and QueryExecution in their own 
> classes. Dependent types add additional, unnecessary complexity; besides, 
> HiveContext would benefit from this change as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7040) Explore receiver-less DStream for Flume

2015-09-14 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744162#comment-14744162
 ] 

Tathagata Das commented on SPARK-7040:
--

I am not sure how Direct API can be built for Flume as Flume does not have any 
offsets or sequence numbers (correct me if I am wrong about this) to refer to 
the exact ranges of records / events. I am closing this JIRA for now, please 
reopen it if you think this is still relevant. 

> Explore receiver-less DStream for Flume
> ---
>
> Key: SPARK-7040
> URL: https://issues.apache.org/jira/browse/SPARK-7040
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>
> I am thinking about repurposing the FlumePollingInputDStream to make it more 
> parallel and pull in data like the DirectKafkaDStream. Since Flume does not 
> have a unique way of identifying specific event offsets, this will not be 
> once-only. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10593) sql lateral view same name gives wrong value

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10593:


Assignee: Apache Spark

> sql lateral view same name gives wrong value
> 
>
> Key: SPARK-10593
> URL: https://issues.apache.org/jira/browse/SPARK-10593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> This query will return wrong result:
> {code}
> select 
> insideLayer1.json as json_insideLayer1, 
> insideLayer2.json as json_insideLayer2 
> from (select '1' id) creatives 
> lateral view json_tuple('{"layer1": {"layer2": "text inside layer 2"}}', 
> 'layer1') insideLayer1 as json 
> lateral view json_tuple(insideLayer1.json, 'layer2') insideLayer2 as json 
> {code}
> It got 
> {code}
> ( {"layer2": "text inside layer 2"},  {"layer2": "text inside layer 2"})
> {code}
> instead of
> {code}
> ( {"layer2": "text inside layer 2"},  text inside layer 2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10593) sql lateral view same name gives wrong value

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10593:


Assignee: (was: Apache Spark)

> sql lateral view same name gives wrong value
> 
>
> Key: SPARK-10593
> URL: https://issues.apache.org/jira/browse/SPARK-10593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>
> This query will return wrong result:
> {code}
> select 
> insideLayer1.json as json_insideLayer1, 
> insideLayer2.json as json_insideLayer2 
> from (select '1' id) creatives 
> lateral view json_tuple('{"layer1": {"layer2": "text inside layer 2"}}', 
> 'layer1') insideLayer1 as json 
> lateral view json_tuple(insideLayer1.json, 'layer2') insideLayer2 as json 
> {code}
> It got 
> {code}
> ( {"layer2": "text inside layer 2"},  {"layer2": "text inside layer 2"})
> {code}
> instead of
> {code}
> ( {"layer2": "text inside layer 2"},  text inside layer 2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10593) sql lateral view same name gives wrong value

2015-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744242#comment-14744242
 ] 

Apache Spark commented on SPARK-10593:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8755

> sql lateral view same name gives wrong value
> 
>
> Key: SPARK-10593
> URL: https://issues.apache.org/jira/browse/SPARK-10593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>
> This query will return wrong result:
> {code}
> select 
> insideLayer1.json as json_insideLayer1, 
> insideLayer2.json as json_insideLayer2 
> from (select '1' id) creatives 
> lateral view json_tuple('{"layer1": {"layer2": "text inside layer 2"}}', 
> 'layer1') insideLayer1 as json 
> lateral view json_tuple(insideLayer1.json, 'layer2') insideLayer2 as json 
> {code}
> It got 
> {code}
> ( {"layer2": "text inside layer 2"},  {"layer2": "text inside layer 2"})
> {code}
> instead of
> {code}
> ( {"layer2": "text inside layer 2"},  text inside layer 2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits

2015-09-14 Thread Robin East (JIRA)
Robin East created SPARK-10598:
--

 Summary: RoutingTablePartition toMessage method refers to bytes 
instead of bits
 Key: SPARK-10598
 URL: https://issues.apache.org/jira/browse/SPARK-10598
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.5.0, 1.4.1, 1.4.0
Reporter: Robin East
Priority: Minor
 Fix For: 1.5.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10522) Nanoseconds part of Timestamp should be positive in parquet

2015-09-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10522.

   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8674
[https://github.com/apache/spark/pull/8674]

> Nanoseconds part of Timestamp should be positive in parquet
> ---
>
> Key: SPARK-10522
> URL: https://issues.apache.org/jira/browse/SPARK-10522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
> Fix For: 1.6.0, 1.5.1
>
>
> If Timestamp is before unix epoch, the nanosecond part will be negative, Hive 
> can't read that back correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6981) [SQL] SparkPlanner and QueryExecution should be factored out from SQLContext

2015-09-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6981:
-
Assignee: Edoardo Vacchi

> [SQL] SparkPlanner and QueryExecution should be factored out from SQLContext
> 
>
> Key: SPARK-6981
> URL: https://issues.apache.org/jira/browse/SPARK-6981
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Edoardo Vacchi
>Assignee: Edoardo Vacchi
>Priority: Minor
> Fix For: 1.6.0
>
>
> In order to simplify extensibility with new strategies from third-parties, it 
> should be better to factor SparkPlanner and QueryExecution in their own 
> classes. Dependent types add additional, unnecessary complexity; besides, 
> HiveContext would benefit from this change as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10543) Peak Execution Memory Quantile should be Per-task Basis

2015-09-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10543.
---
  Resolution: Fixed
Assignee: Sen Fang
   Fix Version/s: 1.5.1
  1.6.0
Target Version/s: 1.6.0, 1.5.1

> Peak Execution Memory Quantile should be Per-task Basis
> ---
>
> Key: SPARK-10543
> URL: https://issues.apache.org/jira/browse/SPARK-10543
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Sen Fang
>Assignee: Sen Fang
>Priority: Minor
> Fix For: 1.6.0, 1.5.1
>
>
> Currently the Peak Execution Memory quantiles seem to be cumulative rather 
> than per task basis. For example, I have seen a value of 2TB in one of my 
> jobs on the quantile metric but each individual task shows less than 1GB on 
> the bottom table.
> [~andrewor14] In your PR https://github.com/apache/spark/pull/7770, the 
> screenshot shows the Max Peak Execution Memory of 792.5KB while in the bottom 
> it's about 50KB per task (unless your workload is skewed)
> The fix seems straightforward that we use the `update` rather than `value` 
> from the accumulable. I'm happy to provide a PR if people agree this is the 
> right behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10317) start-history-server.sh CLI parsing incompatible with HistoryServer's arg parsing

2015-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744451#comment-14744451
 ] 

Apache Spark commented on SPARK-10317:
--

User 'rekhajoshm' has created a pull request for this issue:
https://github.com/apache/spark/pull/8758

> start-history-server.sh CLI parsing incompatible with HistoryServer's arg 
> parsing
> -
>
> Key: SPARK-10317
> URL: https://issues.apache.org/jira/browse/SPARK-10317
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Steve Loughran
>Priority: Trivial
>
> The history server has its argument parsing class in 
> {{HistoryServerArguments}}. However, this doesn't get involved in the 
> {{start-history-server.sh}} codepath where the $0 arg is assigned to  
> {{spark.history.fs.logDirectory}} and all other arguments discarded (e.g 
> {{--property-file}}.
> This stops the other options being usable from this script



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10317) start-history-server.sh CLI parsing incompatible with HistoryServer's arg parsing

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10317:


Assignee: (was: Apache Spark)

> start-history-server.sh CLI parsing incompatible with HistoryServer's arg 
> parsing
> -
>
> Key: SPARK-10317
> URL: https://issues.apache.org/jira/browse/SPARK-10317
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Steve Loughran
>Priority: Trivial
>
> The history server has its argument parsing class in 
> {{HistoryServerArguments}}. However, this doesn't get involved in the 
> {{start-history-server.sh}} codepath where the $0 arg is assigned to  
> {{spark.history.fs.logDirectory}} and all other arguments discarded (e.g 
> {{--property-file}}.
> This stops the other options being usable from this script



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10317) start-history-server.sh CLI parsing incompatible with HistoryServer's arg parsing

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10317:


Assignee: Apache Spark

> start-history-server.sh CLI parsing incompatible with HistoryServer's arg 
> parsing
> -
>
> Key: SPARK-10317
> URL: https://issues.apache.org/jira/browse/SPARK-10317
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Trivial
>
> The history server has its argument parsing class in 
> {{HistoryServerArguments}}. However, this doesn't get involved in the 
> {{start-history-server.sh}} codepath where the $0 arg is assigned to  
> {{spark.history.fs.logDirectory}} and all other arguments discarded (e.g 
> {{--property-file}}.
> This stops the other options being usable from this script



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10603) Univariate statistics as UDAFs: multi-pass continuous stats

2015-09-14 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-10603:
-

 Summary: Univariate statistics as UDAFs: multi-pass continuous 
stats
 Key: SPARK-10603
 URL: https://issues.apache.org/jira/browse/SPARK-10603
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SQL
Reporter: Joseph K. Bradley


See parent JIRA for more details. This subtask covers statistics for continuous 
values requiring multiple passes over the data, such as median and quantiles.

This JIRA is an umbrella. For individual stats, please create and link a new 
JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10573) IndexToString transformSchema adds output field as DoubleType

2015-09-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10573.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> IndexToString transformSchema adds output field as DoubleType
> -
>
> Key: SPARK-10573
> URL: https://issues.apache.org/jira/browse/SPARK-10573
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Nick Pritchard
>Assignee: Nick Pritchard
> Fix For: 1.6.0
>
>
> Reproducible example:
> {code}
> val stage = new IndexToString().setInputCol("input").setOutputCol("output")
> val inSchema = StructType(Seq(StructField("input", DoubleType)))
> val outSchema = stage.transformSchema(inSchema)
> assert(outSchema("output").dataType == StringType)
> {code}
> The root cause seems to be that it uses {{NominalAttribute.toStructField}} 
> which assumes {{DoubleType}}. It would probably be better to just use 
> {{SchemaUtils.appendColumn}} and explicitly set the data type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns

2015-09-14 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744254#comment-14744254
 ] 

Davies Liu commented on SPARK-9325:
---

I would -1 on this.

I'm worried that once we have collect(Column)/head(Column), users will ask for 
count(Column)/first(Column)/Sum(Column)/Avg(Column), then it's hard to tell 
which one should be in or not. Adding APIs in R is harder than 
Scala/Java/Python (because of namespace), we should be more careful on it.  

> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10587) In pyspark, toDF() dosen't exsist in RDD object

2015-09-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10587.
---
Resolution: Not A Problem

> In pyspark, toDF() dosen't exsist in RDD object
> ---
>
> Key: SPARK-10587
> URL: https://issues.apache.org/jira/browse/SPARK-10587
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: SemiCoder
>
> I can't find toDF() function in RDD.
> In pyspark.mllib.linalg.distributed ,  the IndexedRowMatrix.__init__() 
> require the rows should be an RDD and execute rows.toDF() but actually the 
> RDD in pyspark dosen't have toDF() function



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats

2015-09-14 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-10602:
-

 Summary: Univariate statistics as UDAFs: single-pass continuous 
stats
 Key: SPARK-10602
 URL: https://issues.apache.org/jira/browse/SPARK-10602
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SQL
Reporter: Joseph K. Bradley


See parent JIRA for more details.  This subtask covers statistics for 
continuous values requiring a single pass over the data, such as min and max.

This JIRA is an umbrella.  For individual stats, please create and link a new 
JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10591) False negative in QueryTest.checkAnswer

2015-09-14 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-10591:
---
Description: 
# For double and float, {{NaN == NaN}} is always {{false}}
# {{checkAnswer}} doesn't handle {{Map\[K, V\]}} properly. For example:
  {noformat}
  scala> Map(1 -> 2, 2 -> 1).toString
  res0: String = Map(1 -> 2, 2 -> 1)

  scala> Map(2 -> 1, 1 -> 2).toString
  res1: String = Map(2 -> 1, 1 -> 2)
  {noformat}
  We can't rely on {{toString}} to compare {{Map\[K, V\]}} instances.

Need to update {{checkAnswer}} to special case {{NaN}} and {{Map\[K, V\]}}.


  was:
# For double and float, `NaN == NaN` is always `false`
# `checkAnswer` doesn't handle `Map` properly. For example:
  {noformat}
  scala> Map(1 -> 2, 2 -> 1).toString
  res0: String = Map(1 -> 2, 2 -> 1)

  scala> Map(2 -> 1, 1 -> 2).toString
  res1: String = Map(2 -> 1, 1 -> 2)
  {noformat}
  We can't rely on `toString` to compare `Map` instances.

Need to update `checkAnswer` to special case `NaN` and `Map`.



> False negative in QueryTest.checkAnswer
> ---
>
> Key: SPARK-10591
> URL: https://issues.apache.org/jira/browse/SPARK-10591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.1, 1.5.0
>Reporter: Cheng Lian
>
> # For double and float, {{NaN == NaN}} is always {{false}}
> # {{checkAnswer}} doesn't handle {{Map\[K, V\]}} properly. For example:
>   {noformat}
>   scala> Map(1 -> 2, 2 -> 1).toString
>   res0: String = Map(1 -> 2, 2 -> 1)
>   scala> Map(2 -> 1, 1 -> 2).toString
>   res1: String = Map(2 -> 1, 1 -> 2)
>   {noformat}
>   We can't rely on {{toString}} to compare {{Map\[K, V\]}} instances.
> Need to update {{checkAnswer}} to special case {{NaN}} and {{Map\[K, V\]}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns

2015-09-14 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744167#comment-14744167
 ] 

Shivaram Venkataraman commented on SPARK-9325:
--

Just `collect` and maybe `head`. This is just to show / preview what is in a 
column and also to convert columns to local vectors / lists that can be used as 
a vector. I don't think we want to support other functions on this.

> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10594) ApplicationMaster "--help" references the removed "--num-executors" option

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10594:


Assignee: Apache Spark

> ApplicationMaster "--help" references the removed "--num-executors" option
> --
>
> Key: SPARK-10594
> URL: https://issues.apache.org/jira/browse/SPARK-10594
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0
>Reporter: Erick Tryzelaar
>Assignee: Apache Spark
>Priority: Trivial
> Attachments: 
> 0001-SPARK-10594-YARN-Remove-reference-to-num-executors.patch, 
> 0002-SPARK-10594-YARN-Document-ApplicationMaster-properti.patch
>
>
> The issue SPARK-9092 and commit 
> [738f35|https://github.com/apache/spark/commit/738f353988dbf02704bd63f5e35d94402c59ed79]
>  removed the {{ApplicationMaster}} commandline argument {{--num-executors}}, 
> but it's help message still references the 
> [argument|https://github.com/apache/spark/blob/738f353988dbf02704bd63f5e35d94402c59ed79/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala#L108].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10594) ApplicationMaster "--help" references the removed "--num-executors" option

2015-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744209#comment-14744209
 ] 

Apache Spark commented on SPARK-10594:
--

User 'erickt' has created a pull request for this issue:
https://github.com/apache/spark/pull/8754

> ApplicationMaster "--help" references the removed "--num-executors" option
> --
>
> Key: SPARK-10594
> URL: https://issues.apache.org/jira/browse/SPARK-10594
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0
>Reporter: Erick Tryzelaar
>Priority: Trivial
> Attachments: 
> 0001-SPARK-10594-YARN-Remove-reference-to-num-executors.patch, 
> 0002-SPARK-10594-YARN-Document-ApplicationMaster-properti.patch
>
>
> The issue SPARK-9092 and commit 
> [738f35|https://github.com/apache/spark/commit/738f353988dbf02704bd63f5e35d94402c59ed79]
>  removed the {{ApplicationMaster}} commandline argument {{--num-executors}}, 
> but it's help message still references the 
> [argument|https://github.com/apache/spark/blob/738f353988dbf02704bd63f5e35d94402c59ed79/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala#L108].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits

2015-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744275#comment-14744275
 ] 

Apache Spark commented on SPARK-10598:
--

User 'insidedctm' has created a pull request for this issue:
https://github.com/apache/spark/pull/8756

> RoutingTablePartition toMessage method refers to bytes instead of bits
> --
>
> Key: SPARK-10598
> URL: https://issues.apache.org/jira/browse/SPARK-10598
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Robin East
>Priority: Minor
> Fix For: 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10598:


Assignee: (was: Apache Spark)

> RoutingTablePartition toMessage method refers to bytes instead of bits
> --
>
> Key: SPARK-10598
> URL: https://issues.apache.org/jira/browse/SPARK-10598
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Robin East
>Priority: Minor
> Fix For: 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10598:


Assignee: Apache Spark

> RoutingTablePartition toMessage method refers to bytes instead of bits
> --
>
> Key: SPARK-10598
> URL: https://issues.apache.org/jira/browse/SPARK-10598
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Robin East
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10575) Wrap RDD.takeSample with scope

2015-09-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10575:
--
Affects Version/s: 1.4.0
 Target Version/s: 1.6.0

> Wrap RDD.takeSample with scope
> --
>
> Key: SPARK-10575
> URL: https://issues.apache.org/jira/browse/SPARK-10575
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> Remove return statements in RDD.takeSample and wrap it withScope



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits

2015-09-14 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744359#comment-14744359
 ] 

Robin East commented on SPARK-10598:


Apologies - have checked it out. You're referring to Fix and Target Version 
fields right?

> RoutingTablePartition toMessage method refers to bytes instead of bits
> --
>
> Key: SPARK-10598
> URL: https://issues.apache.org/jira/browse/SPARK-10598
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Robin East
>Assignee: Robin East
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10522) Nanoseconds part of Timestamp should be positive in parquet

2015-09-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10522:
--
Assignee: Davies Liu

> Nanoseconds part of Timestamp should be positive in parquet
> ---
>
> Key: SPARK-10522
> URL: https://issues.apache.org/jira/browse/SPARK-10522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.6.0, 1.5.1
>
>
> If Timestamp is before unix epoch, the nanosecond part will be negative, Hive 
> can't read that back correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10549) scala 2.11 spark on yarn with security - Repl doesn't work

2015-09-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10549.
---
  Resolution: Fixed
   Fix Version/s: 1.5.1
  1.6.0
Target Version/s: 1.6.0, 1.5.1

> scala 2.11 spark on yarn with security - Repl doesn't work
> --
>
> Key: SPARK-10549
> URL: https://issues.apache.org/jira/browse/SPARK-10549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 1.6.0, 1.5.1
>
>
> Trying to run spark on secure yarn built with scala 2.11 results in failure 
> when trying to launch the spark shell.
>  ./bin/spark-shell --master yarn-client 
> Exception in thread "main" java.lang.ExceptionInInitializerError
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.Exception: Error: a secret key must be specified via the 
> spark.authenticate.secret config
> at 
> org.apache.spark.SecurityManager.generateSecretKey(SecurityManager.scala:395)
> at org.apache.spark.SecurityManager.(SecurityManager.scala:218)
> at org.apache.spark.repl.Main$.(Main.scala:38)
> at org.apache.spark.repl.Main$.(Main.scala)
> The reason is because it create the SecurityManager before is sets the 
> SPARK_YARN_MODE to true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7040) Explore receiver-less DStream for Flume

2015-09-14 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-7040.
--
Resolution: Invalid

> Explore receiver-less DStream for Flume
> ---
>
> Key: SPARK-7040
> URL: https://issues.apache.org/jira/browse/SPARK-7040
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>
> I am thinking about repurposing the FlumePollingInputDStream to make it more 
> parallel and pull in data like the DirectKafkaDStream. Since Flume does not 
> have a unique way of identifying specific event offsets, this will not be 
> once-only. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10573) IndexToString transformSchema adds output field as DoubleType

2015-09-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10573:
--
Fix Version/s: 1.5.1

> IndexToString transformSchema adds output field as DoubleType
> -
>
> Key: SPARK-10573
> URL: https://issues.apache.org/jira/browse/SPARK-10573
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Nick Pritchard
>Assignee: Nick Pritchard
> Fix For: 1.6.0, 1.5.1
>
>
> Reproducible example:
> {code}
> val stage = new IndexToString().setInputCol("input").setOutputCol("output")
> val inSchema = StructType(Seq(StructField("input", DoubleType)))
> val outSchema = stage.transformSchema(inSchema)
> assert(outSchema("output").dataType == StringType)
> {code}
> The root cause seems to be that it uses {{NominalAttribute.toStructField}} 
> which assumes {{DoubleType}}. It would probably be better to just use 
> {{SchemaUtils.appendColumn}} and explicitly set the data type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10599) Decrease communication in BlockMatrix multiply and increase performance

2015-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744322#comment-14744322
 ] 

Apache Spark commented on SPARK-10599:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/8757

> Decrease communication in BlockMatrix multiply and increase performance
> ---
>
> Key: SPARK-10599
> URL: https://issues.apache.org/jira/browse/SPARK-10599
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Burak Yavuz
>
> The BlockMatrix multiply sends each block to all the corresponding columns of 
> the right BlockMatrix, even though there might not be any corresponding block 
> to multiply with.
> Some optimizations we can perform are:
>  - Simulate the multiplication on the driver, and figure out which blocks 
> actually need to be shuffled
>  - Send the block once to a partition, and join inside the partition rather 
> than sending multiple copies to the same partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10599) Decrease communication in BlockMatrix multiply and increase performance

2015-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10599:


Assignee: (was: Apache Spark)

> Decrease communication in BlockMatrix multiply and increase performance
> ---
>
> Key: SPARK-10599
> URL: https://issues.apache.org/jira/browse/SPARK-10599
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Burak Yavuz
>
> The BlockMatrix multiply sends each block to all the corresponding columns of 
> the right BlockMatrix, even though there might not be any corresponding block 
> to multiply with.
> Some optimizations we can perform are:
>  - Simulate the multiplication on the driver, and figure out which blocks 
> actually need to be shuffled
>  - Send the block once to a partition, and join inside the partition rather 
> than sending multiple copies to the same partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10594) ApplicationMaster "--help" references the removed "--num-executors" option

2015-09-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-10594.
---
  Resolution: Fixed
   Fix Version/s: 1.6.0
Target Version/s: 1.6.0

> ApplicationMaster "--help" references the removed "--num-executors" option
> --
>
> Key: SPARK-10594
> URL: https://issues.apache.org/jira/browse/SPARK-10594
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0
>Reporter: Erick Tryzelaar
>Priority: Trivial
> Fix For: 1.6.0
>
> Attachments: 
> 0001-SPARK-10594-YARN-Remove-reference-to-num-executors.patch, 
> 0002-SPARK-10594-YARN-Document-ApplicationMaster-properti.patch
>
>
> The issue SPARK-9092 and commit 
> [738f35|https://github.com/apache/spark/commit/738f353988dbf02704bd63f5e35d94402c59ed79]
>  removed the {{ApplicationMaster}} commandline argument {{--num-executors}}, 
> but it's help message still references the 
> [argument|https://github.com/apache/spark/blob/738f353988dbf02704bd63f5e35d94402c59ed79/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala#L108].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10594) ApplicationMaster "--help" references the removed "--num-executors" option

2015-09-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10594:
--
Assignee: Erick Tryzelaar

> ApplicationMaster "--help" references the removed "--num-executors" option
> --
>
> Key: SPARK-10594
> URL: https://issues.apache.org/jira/browse/SPARK-10594
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0
>Reporter: Erick Tryzelaar
>Assignee: Erick Tryzelaar
>Priority: Trivial
> Fix For: 1.6.0
>
> Attachments: 
> 0001-SPARK-10594-YARN-Remove-reference-to-num-executors.patch, 
> 0002-SPARK-10594-YARN-Document-ApplicationMaster-properti.patch
>
>
> The issue SPARK-9092 and commit 
> [738f35|https://github.com/apache/spark/commit/738f353988dbf02704bd63f5e35d94402c59ed79]
>  removed the {{ApplicationMaster}} commandline argument {{--num-executors}}, 
> but it's help message still references the 
> [argument|https://github.com/apache/spark/blob/738f353988dbf02704bd63f5e35d94402c59ed79/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala#L108].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9996) Create local nested loop join operator

2015-09-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-9996.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> Create local nested loop join operator
> --
>
> Key: SPARK-9996
> URL: https://issues.apache.org/jira/browse/SPARK-9996
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9997) Create local Expand operator

2015-09-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-9997.
--
  Resolution: Fixed
   Fix Version/s: 1.6.0
Target Version/s: 1.6.0

> Create local Expand operator
> 
>
> Key: SPARK-9997
> URL: https://issues.apache.org/jira/browse/SPARK-9997
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2015-09-14 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744508#comment-14744508
 ] 

Joseph K. Bradley commented on SPARK-8418:
--

Apologies for being AWOL!  I'd definitely appreciate help with designing this 
improvement.

For API (Vector vs. Map): I prefer sticking with a Vector API.  I see the 
appeal of keeping columns separate, but DataFrames are not yet meant to handle 
too many columns (hundreds at most, I'd say).  We can still keep feature names 
and metadata using ML attributes (which describe each feature in Vector columns 
in DataFrames).

For sharing code, we should definitely do option 2.  For backwards 
compatibility, we should not modify current Params, but we could add a new one 
for multiple inputs (and check for conflicting settings when running).  I would 
hope we could share code in this multi-value transformation so that each 
transformer only needs to specify how to transform a single value.  I hope we 
can do this, rather than implementing option 1 as the default.

Would you mind sketching up a quick design doc?  That should help clarify the 
different options and help us choose a simple but flexible API.  If you'd like 
to follow existing examples, here are some ones you could look at:
* Classification threshold (shorter doc): 
[https://docs.google.com/document/d/1nV6m7sqViHkEpawelq1S5_QLWWAouSlv81eiEEjKuJY/edit?usp=sharing]
* R-like stats for model (long doc): 
[https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing]

These items we've discussed can be sketched out in the doc.

After you link it from this JIRA, others can give you feedback on this JIRA 
(better than on the doc since some people have trouble viewing Google docs).

Thanks very much!

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10604) Univariate statistics as UDAFs: categorical stats

2015-09-14 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-10604:
-

 Summary: Univariate statistics as UDAFs: categorical stats
 Key: SPARK-10604
 URL: https://issues.apache.org/jira/browse/SPARK-10604
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SQL
Reporter: Joseph K. Bradley


See parent JIRA for more details. This subtask covers statistics for 
categorical values, such as number of categories or mode.

This JIRA is an umbrella. For individual stats, please create and link a new 
JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns

2015-09-14 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744175#comment-14744175
 ] 

Shivaram Venkataraman commented on SPARK-9325:
--

Hmm not necessarily. If `df$newAge <- df$Age + 1; collect(df$newAge)` works 
that is fine. (the first line already works btw !) 

> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10563) SparkContext's local properties should be cloned when inherited

2015-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744291#comment-14744291
 ] 

Apache Spark commented on SPARK-10563:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/8721

> SparkContext's local properties should be cloned when inherited
> ---
>
> Key: SPARK-10563
> URL: https://issues.apache.org/jira/browse/SPARK-10563
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Currently, when a child thread inherits local properties from the parent 
> thread, it gets a reference of the parent's local properties and uses them as 
> default values.
> The problem, however, is that when the parent changes the value of an 
> existing property, the changes are reflected in the child thread! This has 
> very confusing semantics, especially in streaming.
> {code}
> private val localProperties = new InheritableThreadLocal[Properties] {
>   override protected def childValue(parent: Properties): Properties = new 
> Properties(parent)
>   override protected def initialValue(): Properties = new Properties()
> }
> {code}
> Instead, we should make a clone of the parent properties rather than passing 
> in a mutable reference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10575) Wrap RDD.takeSample with scope

2015-09-14 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10575:
--
Assignee: Vinod KC

> Wrap RDD.takeSample with scope
> --
>
> Key: SPARK-10575
> URL: https://issues.apache.org/jira/browse/SPARK-10575
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Vinod KC
>Assignee: Vinod KC
>Priority: Minor
>
> Remove return statements in RDD.takeSample and wrap it withScope



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10601) Spark SQL - Support for MINUS

2015-09-14 Thread Richard Garris (JIRA)
Richard Garris created SPARK-10601:
--

 Summary: Spark SQL - Support for MINUS
 Key: SPARK-10601
 URL: https://issues.apache.org/jira/browse/SPARK-10601
 Project: Spark
  Issue Type: Improvement
Reporter: Richard Garris


Spark SQL does not current supported SQL Minus






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10587) In pyspark, toDF() dosen't exsist in RDD object

2015-09-14 Thread SemiCoder (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744643#comment-14744643
 ] 

SemiCoder commented on SPARK-10587:
---

It's not my code, it's code in latest released version. 
In fact, when I want to create a IndexedRowMatrix , I give a parameter "rows" , 
and the __init__ method will check whether it's an RDD , if it's an RDD, it 
will call java function, and the one parameter of calljavafunction is 
"rows.toDF()" . However, toDF() doesn't exist in RDD. I know it exists in 
sqlcontext . I mean I think it is an error in 
python/pyspark/mllib/linalg/distriuted.py . Otherwise could you tell me how to 
create an RDD which has function toDF() to avoid this situation? 

> In pyspark, toDF() dosen't exsist in RDD object
> ---
>
> Key: SPARK-10587
> URL: https://issues.apache.org/jira/browse/SPARK-10587
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: SemiCoder
>
> I can't find toDF() function in RDD.
> In pyspark.mllib.linalg.distributed ,  the IndexedRowMatrix.__init__() 
> require the rows should be an RDD and execute rows.toDF() but actually the 
> RDD in pyspark dosen't have toDF() function



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >