[jira] [Created] (SPARK-18241) If Spark Launcher fails to startApplication then handle's state does not change

2016-11-02 Thread Aseem Bansal (JIRA)
Aseem Bansal created SPARK-18241:


 Summary: If Spark Launcher fails to startApplication then handle's 
state does not change
 Key: SPARK-18241
 URL: https://issues.apache.org/jira/browse/SPARK-18241
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.0.0
Reporter: Aseem Bansal


I am using Spark 2.0.0. I am using 
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/launcher/SparkLauncher.html
 to submit my job. 

If there is a failure after launcher's startapplication has been called but 
before the spark job has actually started (i.e. in starting the spark process 
that submits the job itself) there is 
* no exception in the main thread that is submitting the job 
* no exception in the job as it has not started
* no state change of the launcher
* the exception is logged in the error stream on the default logger name that 
spark produces using the Job's main class.

Basically, it is not possible to catch an exception if it happens during that 
time. The easiest way to reproduce it is to delete the JAR file or use an 
invalid spark home while launching the job using sparkLauncher. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-11-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18200.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.1.0
   2.0.3

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>Assignee: Dongjoon Hyun
>  Labels: graph, graphx
> Fix For: 2.0.3, 2.1.0
>
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18240) Add Summary of BiKMeans and GMM in pyspark

2016-11-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18240:


Assignee: Apache Spark

> Add Summary of BiKMeans and GMM in pyspark
> --
>
> Key: SPARK-18240
> URL: https://issues.apache.org/jira/browse/SPARK-18240
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Assignee: Apache Spark
>
> Add Summary of BiKMeans and GMM in pyspark.
> Since KMeansSummary in pyspark will be added in SPARK-15819, this JIRA will 
> not deal with KMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18240) Add Summary of BiKMeans and GMM in pyspark

2016-11-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631823#comment-15631823
 ] 

Apache Spark commented on SPARK-18240:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15748

> Add Summary of BiKMeans and GMM in pyspark
> --
>
> Key: SPARK-18240
> URL: https://issues.apache.org/jira/browse/SPARK-18240
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>
> Add Summary of BiKMeans and GMM in pyspark.
> Since KMeansSummary in pyspark will be added in SPARK-15819, this JIRA will 
> not deal with KMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18240) Add Summary of BiKMeans and GMM in pyspark

2016-11-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18240:


Assignee: (was: Apache Spark)

> Add Summary of BiKMeans and GMM in pyspark
> --
>
> Key: SPARK-18240
> URL: https://issues.apache.org/jira/browse/SPARK-18240
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>
> Add Summary of BiKMeans and GMM in pyspark.
> Since KMeansSummary in pyspark will be added in SPARK-15819, this JIRA will 
> not deal with KMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18240) Add Summary of BiKMeans and GMM in pyspark

2016-11-02 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-18240:


 Summary: Add Summary of BiKMeans and GMM in pyspark
 Key: SPARK-18240
 URL: https://issues.apache.org/jira/browse/SPARK-18240
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: zhengruifeng


Add Summary of BiKMeans and GMM in pyspark.
Since KMeansSummary in pyspark will be added in SPARK-15819, this JIRA will not 
deal with KMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15507) ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row

2016-11-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631810#comment-15631810
 ] 

Reynold Xin commented on SPARK-15507:
-

That's expected isn't that?


> ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row
> 
>
> Key: SPARK-15507
> URL: https://issues.apache.org/jira/browse/SPARK-15507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT may 24
>Reporter: koert kuipers
>
> given this code:
> {noformat}
> case class Test(a: Int, b: String)
> val rdd = sc.parallelize(List(Row(List(Test(5, "ha"), Test(6, "ba")
> val schema = StructType(Seq(
>   StructField("x", ArrayType(
> StructType(Seq(
>   StructField("a", IntegerType, false),
>   StructField("b", StringType, true)
> )),
> true)
>   , true)
>   ))
> val df = sqlc.createDataFrame(rdd, schema)
> df.show
> {noformat}
> this works fine in spark 1.6.1 and gives:
> {noformat}
> ++
> |   x|
> ++
> |[[5,ha], [6,ba]]|
> ++
> {noformat}
> but in spark 2.0.0-SNAPSHOT i get:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.RuntimeException: Error while encoding: 
> java.lang.ClassCastException: Test cannot be cast to org.apache.spark.sql.Row
> [info] getexternalrowfield(input[0, org.apache.spark.sql.Row, false], 0, x, 
> IntegerType) AS x#0
> [info] +- getexternalrowfield(input[0, org.apache.spark.sql.Row, false], 0, 
> x, IntegerType)
> [info]+- input[0, org.apache.spark.sql.Row, false]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18227) Parquet file stream sink create a hidden directory "_spark_metadata" cause the DataFrame read from directory failed

2016-11-02 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631760#comment-15631760
 ] 

Liwei Lin commented on SPARK-18227:
---

Thanks for reporting this. It used to be a problem but had been fixed at least 
in master(please see 
https://github.com/apache/spark/blob/4ef39c2f4436fa22d0b957fe7ad477e4c4a16452/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L406-L413).

> Parquet file stream sink create a hidden directory "_spark_metadata" cause 
> the DataFrame read from directory failed
> ---
>
> Key: SPARK-18227
> URL: https://issues.apache.org/jira/browse/SPARK-18227
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Lantao Jin
>
> When we set an out directory as a streaming sink with parquet format in 
> structured streaming,  as the streaming job running, all output parquet files 
> will be written to this out directory. However, it also creates a hidden 
> directory called "_spark_metadata" in the out directory. If we load the 
> parquet files from the out directory by "load", it will throw 
> RuntimeException and task failed.
> {code:java}
> val stream = modifiedData.writeStream.format("parquet")
> .option("checkpointLocation", "/path/ck/")
> .start("/path/out/")
> val df1 = spark.read.format("parquet").load("/path/out/*")
> {code}
> {panel}
> 16/11/02 03:49:40 WARN TaskSetManager: Lost task 1.0 in stage 110.0 (TID 
> 3131, cupid044.stratus.phx.ebay.com): java.lang.Ru
> ntimeException: hdfs:///path/out/_spark_metadata/0 is not a Parquet file (too 
> s
> mall)   
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:412)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
> at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRec
> ordReaderBase.java:107)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRec
> ordReader.java:109)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFor
> mat.scala:367)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFor
> mat.scala:341)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Sour
> ce) 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> {panel}
> That's because the ParquetFileReader reads the metadata file as a parquet 
> format. 
> I thought the smooth way to fix it is moving the metadata directory to 
> another path, but from the code DataSource.scala, it has less path 
> information except out directory path to store into. So maybe skipping hidden 
> files and paths could be a better way. But from the stack trace above, it 
> failed in initialize() in SpecificParquetRecordReaderBase. It means  that 
> metadata files in hidden directory have been traversed in upper 
> invocation(FileScanRDD). But in there, no format info can be known to skip a 
> hidden directory(or over authority).
> So, what is the best way to fix it?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14393) values generated by non-deterministic functions shouldn't change after coalesce or union

2016-11-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631653#comment-15631653
 ] 

Apache Spark commented on SPARK-14393:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/15747

> values generated by non-deterministic functions shouldn't change after 
> coalesce or union
> 
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0, 2.0.1
>Reporter: Jason Piper
>Assignee: Xiangrui Meng
>Priority: Blocker
>  Labels: correctness, releasenotes
> Fix For: 2.1.0
>
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> +---+
> >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  1|
> |  0|
> |  0|
> |  1|
> |  2|
> |  3|
> |  0|
> |  1|
> |  2|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17822) JVMObjectTracker.objMap may leak JVM objects

2016-11-02 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631612#comment-15631612
 ] 

Felix Cheung commented on SPARK-17822:
--

I see. Is it possible that the R object is alive? Does running gc in R help?
https://stat.ethz.ch/R-manual/R-devel/library/base/html/gc.html

It would be great if there is a way you could share what the R code looks like.



> JVMObjectTracker.objMap may leak JVM objects
> 
>
> Key: SPARK-17822
> URL: https://issues.apache.org/jira/browse/SPARK-17822
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
> Attachments: screenshot-1.png
>
>
> JVMObjectTracker.objMap is used to track JVM objects for SparkR. However, we 
> observed that JVM objects that are not used anymore are still trapped in this 
> map, which prevents those object get GCed. 
> Seems it makes sense to use weak reference (like persistentRdds in 
> SparkContext). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18239) Gradient Boosted Tree wrapper in SparkR

2016-11-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631585#comment-15631585
 ] 

Apache Spark commented on SPARK-18239:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/15746

> Gradient Boosted Tree wrapper in SparkR
> ---
>
> Key: SPARK-18239
> URL: https://issues.apache.org/jira/browse/SPARK-18239
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.0.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18239) Gradient Boosted Tree wrapper in SparkR

2016-11-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18239:


Assignee: Felix Cheung  (was: Apache Spark)

> Gradient Boosted Tree wrapper in SparkR
> ---
>
> Key: SPARK-18239
> URL: https://issues.apache.org/jira/browse/SPARK-18239
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.0.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18239) Gradient Boosted Tree wrapper in SparkR

2016-11-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18239:


Assignee: Apache Spark  (was: Felix Cheung)

> Gradient Boosted Tree wrapper in SparkR
> ---
>
> Key: SPARK-18239
> URL: https://issues.apache.org/jira/browse/SPARK-18239
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.0.1
>Reporter: Felix Cheung
>Assignee: Apache Spark
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18239) Gradient Boosted Tree wrapper in SparkR

2016-11-02 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18239:
-
Component/s: ML

> Gradient Boosted Tree wrapper in SparkR
> ---
>
> Key: SPARK-18239
> URL: https://issues.apache.org/jira/browse/SPARK-18239
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.0.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18239) Gradient Boosted Tree wrapper in SparkR

2016-11-02 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-18239:


 Summary: Gradient Boosted Tree wrapper in SparkR
 Key: SPARK-18239
 URL: https://issues.apache.org/jira/browse/SPARK-18239
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.0.1
Reporter: Felix Cheung
Assignee: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2016-11-02 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631551#comment-15631551
 ] 

Felix Cheung commented on SPARK-15799:
--

Hi - how are we on this? With Spark 2.0.0 and 2.0.1 addressing most of the 
issues, what is the next step to publish to CRAN?
Would the upcoming 2.0.2 or 2.1.0 be a good target for that?


> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15507) ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row

2016-11-02 Thread Anbu Cheeralan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631543#comment-15631543
 ] 

Anbu Cheeralan edited comment on SPARK-15507 at 11/3/16 4:38 AM:
-

This also creates issues in float to double conversion.
Error:
Caused by: java.lang.RuntimeException: java.lang.Float is not a valid external 
type for schema of double

{code}
val schema = StructType(Seq(
  StructField("a", DoubleType, true)
))
val rdd = spark.sparkContext.parallelize(Seq(Row(1.1f)))

val df = spark.createDataFrame(rdd, schema)
df.show
{code}


was (Author: alunarbeach):
This also creates issues in float to Double conversion as well.
Error:
Caused by: java.lang.RuntimeException: java.lang.Float is not a valid external 
type for schema of double

{code}
val schema = StructType(Seq(
  StructField("a", DoubleType, true)
))
val rdd = spark.sparkContext.parallelize(Seq(Row(1.1f)))

val df = spark.createDataFrame(rdd, schema)
df.show
{code}

> ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row
> 
>
> Key: SPARK-15507
> URL: https://issues.apache.org/jira/browse/SPARK-15507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT may 24
>Reporter: koert kuipers
>
> given this code:
> {noformat}
> case class Test(a: Int, b: String)
> val rdd = sc.parallelize(List(Row(List(Test(5, "ha"), Test(6, "ba")
> val schema = StructType(Seq(
>   StructField("x", ArrayType(
> StructType(Seq(
>   StructField("a", IntegerType, false),
>   StructField("b", StringType, true)
> )),
> true)
>   , true)
>   ))
> val df = sqlc.createDataFrame(rdd, schema)
> df.show
> {noformat}
> this works fine in spark 1.6.1 and gives:
> {noformat}
> ++
> |   x|
> ++
> |[[5,ha], [6,ba]]|
> ++
> {noformat}
> but in spark 2.0.0-SNAPSHOT i get:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.RuntimeException: Error while encoding: 
> java.lang.ClassCastException: Test cannot be cast to org.apache.spark.sql.Row
> [info] getexternalrowfield(input[0, org.apache.spark.sql.Row, false], 0, x, 
> IntegerType) AS x#0
> [info] +- getexternalrowfield(input[0, org.apache.spark.sql.Row, false], 0, 
> x, IntegerType)
> [info]+- input[0, org.apache.spark.sql.Row, false]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15507) ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row

2016-11-02 Thread Anbu Cheeralan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631543#comment-15631543
 ] 

Anbu Cheeralan edited comment on SPARK-15507 at 11/3/16 4:38 AM:
-

This also creates issues in float to Double conversion as well.
Error:
Caused by: java.lang.RuntimeException: java.lang.Float is not a valid external 
type for schema of double

{code}
val schema = StructType(Seq(
  StructField("a", DoubleType, true)
))
val rdd = spark.sparkContext.parallelize(Seq(Row(1.1f)))

val df = spark.createDataFrame(rdd, schema)
df.show
{code}


was (Author: alunarbeach):
This also creates issues in float to Double conversion as well.
{code}
val schema = StructType(Seq(
  StructField("a", DoubleType, true)
))
val rdd = spark.sparkContext.parallelize(Seq(Row(1.1f)))

val df = spark.createDataFrame(rdd, schema)
df.show
{code}

> ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row
> 
>
> Key: SPARK-15507
> URL: https://issues.apache.org/jira/browse/SPARK-15507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT may 24
>Reporter: koert kuipers
>
> given this code:
> {noformat}
> case class Test(a: Int, b: String)
> val rdd = sc.parallelize(List(Row(List(Test(5, "ha"), Test(6, "ba")
> val schema = StructType(Seq(
>   StructField("x", ArrayType(
> StructType(Seq(
>   StructField("a", IntegerType, false),
>   StructField("b", StringType, true)
> )),
> true)
>   , true)
>   ))
> val df = sqlc.createDataFrame(rdd, schema)
> df.show
> {noformat}
> this works fine in spark 1.6.1 and gives:
> {noformat}
> ++
> |   x|
> ++
> |[[5,ha], [6,ba]]|
> ++
> {noformat}
> but in spark 2.0.0-SNAPSHOT i get:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.RuntimeException: Error while encoding: 
> java.lang.ClassCastException: Test cannot be cast to org.apache.spark.sql.Row
> [info] getexternalrowfield(input[0, org.apache.spark.sql.Row, false], 0, x, 
> IntegerType) AS x#0
> [info] +- getexternalrowfield(input[0, org.apache.spark.sql.Row, false], 0, 
> x, IntegerType)
> [info]+- input[0, org.apache.spark.sql.Row, false]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15507) ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row

2016-11-02 Thread Anbu Cheeralan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631543#comment-15631543
 ] 

Anbu Cheeralan edited comment on SPARK-15507 at 11/3/16 4:37 AM:
-

This also creates issues in float to Double conversion as well.
{code:scala}
val schema = StructType(Seq(
  StructField("a", DoubleType, true)
))
val rdd = spark.sparkContext.parallelize(Seq(Row(1.1f)))

val df = spark.createDataFrame(rdd, schema)
df.show
{code}


was (Author: alunarbeach):
This also creates issues in float to Double conversion as well.
``` scala
val schema = StructType(Seq(
  StructField("a", DoubleType, true)
))
val rdd = spark.sparkContext.parallelize(Seq(Row(1.1f)))

val df = spark.createDataFrame(rdd, schema)
df.show
```

> ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row
> 
>
> Key: SPARK-15507
> URL: https://issues.apache.org/jira/browse/SPARK-15507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT may 24
>Reporter: koert kuipers
>
> given this code:
> {noformat}
> case class Test(a: Int, b: String)
> val rdd = sc.parallelize(List(Row(List(Test(5, "ha"), Test(6, "ba")
> val schema = StructType(Seq(
>   StructField("x", ArrayType(
> StructType(Seq(
>   StructField("a", IntegerType, false),
>   StructField("b", StringType, true)
> )),
> true)
>   , true)
>   ))
> val df = sqlc.createDataFrame(rdd, schema)
> df.show
> {noformat}
> this works fine in spark 1.6.1 and gives:
> {noformat}
> ++
> |   x|
> ++
> |[[5,ha], [6,ba]]|
> ++
> {noformat}
> but in spark 2.0.0-SNAPSHOT i get:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.RuntimeException: Error while encoding: 
> java.lang.ClassCastException: Test cannot be cast to org.apache.spark.sql.Row
> [info] getexternalrowfield(input[0, org.apache.spark.sql.Row, false], 0, x, 
> IntegerType) AS x#0
> [info] +- getexternalrowfield(input[0, org.apache.spark.sql.Row, false], 0, 
> x, IntegerType)
> [info]+- input[0, org.apache.spark.sql.Row, false]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15507) ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row

2016-11-02 Thread Anbu Cheeralan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631543#comment-15631543
 ] 

Anbu Cheeralan edited comment on SPARK-15507 at 11/3/16 4:37 AM:
-

This also creates issues in float to Double conversion as well.
{code}
val schema = StructType(Seq(
  StructField("a", DoubleType, true)
))
val rdd = spark.sparkContext.parallelize(Seq(Row(1.1f)))

val df = spark.createDataFrame(rdd, schema)
df.show
{code}


was (Author: alunarbeach):
This also creates issues in float to Double conversion as well.
{code:scala}
val schema = StructType(Seq(
  StructField("a", DoubleType, true)
))
val rdd = spark.sparkContext.parallelize(Seq(Row(1.1f)))

val df = spark.createDataFrame(rdd, schema)
df.show
{code}

> ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row
> 
>
> Key: SPARK-15507
> URL: https://issues.apache.org/jira/browse/SPARK-15507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT may 24
>Reporter: koert kuipers
>
> given this code:
> {noformat}
> case class Test(a: Int, b: String)
> val rdd = sc.parallelize(List(Row(List(Test(5, "ha"), Test(6, "ba")
> val schema = StructType(Seq(
>   StructField("x", ArrayType(
> StructType(Seq(
>   StructField("a", IntegerType, false),
>   StructField("b", StringType, true)
> )),
> true)
>   , true)
>   ))
> val df = sqlc.createDataFrame(rdd, schema)
> df.show
> {noformat}
> this works fine in spark 1.6.1 and gives:
> {noformat}
> ++
> |   x|
> ++
> |[[5,ha], [6,ba]]|
> ++
> {noformat}
> but in spark 2.0.0-SNAPSHOT i get:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.RuntimeException: Error while encoding: 
> java.lang.ClassCastException: Test cannot be cast to org.apache.spark.sql.Row
> [info] getexternalrowfield(input[0, org.apache.spark.sql.Row, false], 0, x, 
> IntegerType) AS x#0
> [info] +- getexternalrowfield(input[0, org.apache.spark.sql.Row, false], 0, 
> x, IntegerType)
> [info]+- input[0, org.apache.spark.sql.Row, false]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15507) ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row

2016-11-02 Thread Anbu Cheeralan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631543#comment-15631543
 ] 

Anbu Cheeralan commented on SPARK-15507:


This also creates issues in float to Double conversion as well.
``` scala
val schema = StructType(Seq(
  StructField("a", DoubleType, true)
))
val rdd = spark.sparkContext.parallelize(Seq(Row(1.1f)))

val df = spark.createDataFrame(rdd, schema)
df.show
```

> ClassCastException: SomeCaseClass cannot be cast to org.apache.spark.sql.Row
> 
>
> Key: SPARK-15507
> URL: https://issues.apache.org/jira/browse/SPARK-15507
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT may 24
>Reporter: koert kuipers
>
> given this code:
> {noformat}
> case class Test(a: Int, b: String)
> val rdd = sc.parallelize(List(Row(List(Test(5, "ha"), Test(6, "ba")
> val schema = StructType(Seq(
>   StructField("x", ArrayType(
> StructType(Seq(
>   StructField("a", IntegerType, false),
>   StructField("b", StringType, true)
> )),
> true)
>   , true)
>   ))
> val df = sqlc.createDataFrame(rdd, schema)
> df.show
> {noformat}
> this works fine in spark 1.6.1 and gives:
> {noformat}
> ++
> |   x|
> ++
> |[[5,ha], [6,ba]]|
> ++
> {noformat}
> but in spark 2.0.0-SNAPSHOT i get:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.lang.RuntimeException: Error while encoding: 
> java.lang.ClassCastException: Test cannot be cast to org.apache.spark.sql.Row
> [info] getexternalrowfield(input[0, org.apache.spark.sql.Row, false], 0, x, 
> IntegerType) AS x#0
> [info] +- getexternalrowfield(input[0, org.apache.spark.sql.Row, false], 0, 
> x, IntegerType)
> [info]+- input[0, org.apache.spark.sql.Row, false]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18175) Improve the test case coverage of implicit type casting

2016-11-02 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-18175:
---

Assignee: Xiao Li

> Improve the test case coverage of implicit type casting
> ---
>
> Key: SPARK-18175
> URL: https://issues.apache.org/jira/browse/SPARK-18175
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> So far, we have limited test case coverage about implicit type casting. We 
> need to draw a matrix to find all the possible casting pairs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18175) Improve the test case coverage of implicit type casting

2016-11-02 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18175.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15691
[https://github.com/apache/spark/pull/15691]

> Improve the test case coverage of implicit type casting
> ---
>
> Key: SPARK-18175
> URL: https://issues.apache.org/jira/browse/SPARK-18175
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiao Li
> Fix For: 2.1.0
>
>
> So far, we have limited test case coverage about implicit type casting. We 
> need to draw a matrix to find all the possible casting pairs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI

2016-11-02 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631463#comment-15631463
 ] 

holdenk commented on SPARK-18128:
-

Extracted from the discussion around SPARK-1267:

People who are officially allowed to make releases will need to register on 
PyPI and PyPI test, create .pypirc files with their credentials and be added to 
the "pyspark" or "apache-pyspark" project (depending on the name that is 
chosen) and the release script will need to be updated slightly. Code wise the 
changes required for SPARK-18128 are relatively minor, whatever changing of 
package name may be required, and adding a shell variable to control which PyPI 
server is being published to, and during publish switching sdist to sdist 
upload.

> Add support for publishing to PyPI
> --
>
> Key: SPARK-18128
> URL: https://issues.apache.org/jira/browse/SPARK-18128
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>
> After SPARK-1267 is done we should add support for publishing to PyPI similar 
> to how we publish to maven central.
> Note: one of the open questions is what to do about package name since 
> someone has registered the package name PySpark on PyPI - we could use 
> ApachePySpark or we could try and get find who registered PySpark and get 
> them to transfer it to us (since they haven't published anything so maybe 
> fine?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17963) Add examples (extend) in each function and improve documentation

2016-11-02 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17963:

Summary: Add examples (extend) in each function and improve documentation  
(was: Add examples (extend) in each function and improve documentation with 
arguments)

> Add examples (extend) in each function and improve documentation
> 
>
> Key: SPARK-17963
> URL: https://issues.apache.org/jira/browse/SPARK-17963
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, it seems function documentation is inconsistent and does not have 
> examples ({{extend}} much.
> For example, some functions have a bad indentation as below:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED approx_count_distinct;
> Function: approx_count_distinct
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus
> Usage: approx_count_distinct(expr) - Returns the estimated cardinality by 
> HyperLogLog++.
> approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated 
> cardinality by HyperLogLog++
>   with relativeSD, the maximum estimation error allowed.
> Extended Usage:
> No example for approx_count_distinct.
> {code}
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED count;
> Function: count
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
> Usage: count(*) - Returns the total number of retrieved rows, including rows 
> containing NULL values.
> count(expr) - Returns the number of rows for which the supplied 
> expression is non-NULL.
> count(DISTINCT expr[, expr...]) - Returns the number of rows for which 
> the supplied expression(s) are unique and non-NULL.
> Extended Usage:
> No example for count.
> {code}
> whereas some do have a pretty one
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
> Function: percentile_approx
> Class: 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
> Usage:
>   percentile_approx(col, percentage [, accuracy]) - Returns the 
> approximate percentile value of numeric
>   column `col` at the given percentage. The value of percentage must be 
> between 0.0
>   and 1.0. The `accuracy` parameter (default: 1) is a positive 
> integer literal which
>   controls approximation accuracy at the cost of memory. Higher value of 
> `accuracy` yields
>   better accuracy, `1.0/accuracy` is the relative error of the 
> approximation.
>   percentile_approx(col, array(percentage1 [, percentage2]...) [, 
> accuracy]) - Returns the approximate
>   percentile array of column `col` at the given percentage array. Each 
> value of the
>   percentage array must be between 0.0 and 1.0. The `accuracy` parameter 
> (default: 1) is
>a positive integer literal which controls approximation accuracy at 
> the cost of memory.
>Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is 
> the relative error of
>the approximation.
> Extended Usage:
> No example for percentile_approx.
> {code}
> Also, there are several inconsistent indentation, for example, 
> {{_FUNC_(a,b)}} and {{_FUNC_(a, b)}} (note the indentation between arguments.
> It'd be nicer if most of them have a good example with possible argument 
> types.
> Suggested format is as below for multiple line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED rand;
> Function: rand
> Class: org.apache.spark.sql.catalyst.expressions.Rand
> Usage:
>   rand() - Returns a random column with i.i.d. uniformly distributed 
> values in [0, 1].
> seed is given randomly.
>   rand(seed) - Returns a random column with i.i.d. uniformly distributed 
> values in [0, 1].
> seed should be an integer/long/NULL literal.
> Extended Usage:
> > SELECT rand();
>  0.9629742951434543
> > SELECT rand(0);
>  0.8446490682263027
> > SELECT rand(NULL);
>  0.8446490682263027
> {code}
> For single line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED date_add;
> Function: date_add
> Class: org.apache.spark.sql.catalyst.expressions.DateAdd
> Usage: date_add(start_date, num_days) - Returns the date that is num_days 
> after start_date.
> Extended Usage:
> > SELECT date_add('2016-07-30', 1);
>  '2016-07-31'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments

2016-11-02 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17963:

Assignee: Hyukjin Kwon

> Add examples (extend) in each function and improve documentation with 
> arguments
> ---
>
> Key: SPARK-17963
> URL: https://issues.apache.org/jira/browse/SPARK-17963
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, it seems function documentation is inconsistent and does not have 
> examples ({{extend}} much.
> For example, some functions have a bad indentation as below:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED approx_count_distinct;
> Function: approx_count_distinct
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus
> Usage: approx_count_distinct(expr) - Returns the estimated cardinality by 
> HyperLogLog++.
> approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated 
> cardinality by HyperLogLog++
>   with relativeSD, the maximum estimation error allowed.
> Extended Usage:
> No example for approx_count_distinct.
> {code}
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED count;
> Function: count
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
> Usage: count(*) - Returns the total number of retrieved rows, including rows 
> containing NULL values.
> count(expr) - Returns the number of rows for which the supplied 
> expression is non-NULL.
> count(DISTINCT expr[, expr...]) - Returns the number of rows for which 
> the supplied expression(s) are unique and non-NULL.
> Extended Usage:
> No example for count.
> {code}
> whereas some do have a pretty one
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
> Function: percentile_approx
> Class: 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
> Usage:
>   percentile_approx(col, percentage [, accuracy]) - Returns the 
> approximate percentile value of numeric
>   column `col` at the given percentage. The value of percentage must be 
> between 0.0
>   and 1.0. The `accuracy` parameter (default: 1) is a positive 
> integer literal which
>   controls approximation accuracy at the cost of memory. Higher value of 
> `accuracy` yields
>   better accuracy, `1.0/accuracy` is the relative error of the 
> approximation.
>   percentile_approx(col, array(percentage1 [, percentage2]...) [, 
> accuracy]) - Returns the approximate
>   percentile array of column `col` at the given percentage array. Each 
> value of the
>   percentage array must be between 0.0 and 1.0. The `accuracy` parameter 
> (default: 1) is
>a positive integer literal which controls approximation accuracy at 
> the cost of memory.
>Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is 
> the relative error of
>the approximation.
> Extended Usage:
> No example for percentile_approx.
> {code}
> Also, there are several inconsistent indentation, for example, 
> {{_FUNC_(a,b)}} and {{_FUNC_(a, b)}} (note the indentation between arguments.
> It'd be nicer if most of them have a good example with possible argument 
> types.
> Suggested format is as below for multiple line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED rand;
> Function: rand
> Class: org.apache.spark.sql.catalyst.expressions.Rand
> Usage:
>   rand() - Returns a random column with i.i.d. uniformly distributed 
> values in [0, 1].
> seed is given randomly.
>   rand(seed) - Returns a random column with i.i.d. uniformly distributed 
> values in [0, 1].
> seed should be an integer/long/NULL literal.
> Extended Usage:
> > SELECT rand();
>  0.9629742951434543
> > SELECT rand(0);
>  0.8446490682263027
> > SELECT rand(NULL);
>  0.8446490682263027
> {code}
> For single line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED date_add;
> Function: date_add
> Class: org.apache.spark.sql.catalyst.expressions.DateAdd
> Usage: date_add(start_date, num_days) - Returns the date that is num_days 
> after start_date.
> Extended Usage:
> > SELECT date_add('2016-07-30', 1);
>  '2016-07-31'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments

2016-11-02 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-17963.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15677
[https://github.com/apache/spark/pull/15677]

> Add examples (extend) in each function and improve documentation with 
> arguments
> ---
>
> Key: SPARK-17963
> URL: https://issues.apache.org/jira/browse/SPARK-17963
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, it seems function documentation is inconsistent and does not have 
> examples ({{extend}} much.
> For example, some functions have a bad indentation as below:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED approx_count_distinct;
> Function: approx_count_distinct
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus
> Usage: approx_count_distinct(expr) - Returns the estimated cardinality by 
> HyperLogLog++.
> approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated 
> cardinality by HyperLogLog++
>   with relativeSD, the maximum estimation error allowed.
> Extended Usage:
> No example for approx_count_distinct.
> {code}
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED count;
> Function: count
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
> Usage: count(*) - Returns the total number of retrieved rows, including rows 
> containing NULL values.
> count(expr) - Returns the number of rows for which the supplied 
> expression is non-NULL.
> count(DISTINCT expr[, expr...]) - Returns the number of rows for which 
> the supplied expression(s) are unique and non-NULL.
> Extended Usage:
> No example for count.
> {code}
> whereas some do have a pretty one
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
> Function: percentile_approx
> Class: 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
> Usage:
>   percentile_approx(col, percentage [, accuracy]) - Returns the 
> approximate percentile value of numeric
>   column `col` at the given percentage. The value of percentage must be 
> between 0.0
>   and 1.0. The `accuracy` parameter (default: 1) is a positive 
> integer literal which
>   controls approximation accuracy at the cost of memory. Higher value of 
> `accuracy` yields
>   better accuracy, `1.0/accuracy` is the relative error of the 
> approximation.
>   percentile_approx(col, array(percentage1 [, percentage2]...) [, 
> accuracy]) - Returns the approximate
>   percentile array of column `col` at the given percentage array. Each 
> value of the
>   percentage array must be between 0.0 and 1.0. The `accuracy` parameter 
> (default: 1) is
>a positive integer literal which controls approximation accuracy at 
> the cost of memory.
>Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is 
> the relative error of
>the approximation.
> Extended Usage:
> No example for percentile_approx.
> {code}
> Also, there are several inconsistent indentation, for example, 
> {{_FUNC_(a,b)}} and {{_FUNC_(a, b)}} (note the indentation between arguments.
> It'd be nicer if most of them have a good example with possible argument 
> types.
> Suggested format is as below for multiple line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED rand;
> Function: rand
> Class: org.apache.spark.sql.catalyst.expressions.Rand
> Usage:
>   rand() - Returns a random column with i.i.d. uniformly distributed 
> values in [0, 1].
> seed is given randomly.
>   rand(seed) - Returns a random column with i.i.d. uniformly distributed 
> values in [0, 1].
> seed should be an integer/long/NULL literal.
> Extended Usage:
> > SELECT rand();
>  0.9629742951434543
> > SELECT rand(0);
>  0.8446490682263027
> > SELECT rand(NULL);
>  0.8446490682263027
> {code}
> For single line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED date_add;
> Function: date_add
> Class: org.apache.spark.sql.catalyst.expressions.DateAdd
> Usage: date_add(start_date, num_days) - Returns the date that is num_days 
> after start_date.
> Extended Usage:
> > SELECT date_add('2016-07-30', 1);
>  '2016-07-31'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17459) Add Linear Discriminant to dimensionality reduction algorithms

2016-11-02 Thread Joshua Howard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631369#comment-15631369
 ] 

Joshua Howard commented on SPARK-17459:
---

Got it. Won't touch the fields. 

Whenever a committer can get around to it I'd like to participate in bringing 
it in. 

> Add Linear Discriminant to dimensionality reduction algorithms
> --
>
> Key: SPARK-17459
> URL: https://issues.apache.org/jira/browse/SPARK-17459
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joshua Howard
>Priority: Minor
>
> The goal is to add linear discriminant analysis as a method of dimensionality 
> reduction. The algorithm and code are very similar to PCA, but instead 
> project the data set onto vectors that provide class separation. LDA is a 
> more effective alternative to PCA in terms of preprocessing for 
> classification algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14319) Speed up group-by aggregates

2016-11-02 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-14319:
---
Fix Version/s: 2.0.0

> Speed up group-by aggregates
> 
>
> Key: SPARK-14319
> URL: https://issues.apache.org/jira/browse/SPARK-14319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sameer Agarwal
> Fix For: 2.0.0
>
>
> Aggregates with key in SparkSQL are almost 30x slower than aggregates with 
> key. This master JIRA tracks our attempts to optimize them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14319) Speed up group-by aggregates

2016-11-02 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal resolved SPARK-14319.

Resolution: Done

> Speed up group-by aggregates
> 
>
> Key: SPARK-14319
> URL: https://issues.apache.org/jira/browse/SPARK-14319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sameer Agarwal
>
> Aggregates with key in SparkSQL are almost 30x slower than aggregates with 
> key. This master JIRA tracks our attempts to optimize them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2016-11-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18207:


Assignee: (was: Apache Spark)

> class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> 
>
> Key: SPARK-18207
> URL: https://issues.apache.org/jira/browse/SPARK-18207
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Don Drake
> Attachments: spark-18207.txt
>
>
> I have 2 wide dataframes that contain nested data structures, when I explode 
> one of the dataframes, it doesn't include records with an empty nested 
> structure (outer explode not supported).  So, I create a similar dataframe 
> with null values and union them together.  See SPARK-13721 for more details 
> as to why I have to do this.
> I was hoping that SPARK-16845 was going to address my issue, but it does not. 
>  I was asked by [~lwlin] to open this JIRA.  
> I will attach a code snippet that can be pasted into spark-shell that 
> duplicates my code and the exception.  This worked just fine in Spark 1.6.x.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in 
> stage 5.0 failed 4 times, most recent failure: Lost task 35.3 in stage 5.0 
> (TID 812, somehost.mydomain.com, executor 8): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2016-11-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18207:


Assignee: Apache Spark

> class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> 
>
> Key: SPARK-18207
> URL: https://issues.apache.org/jira/browse/SPARK-18207
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Don Drake
>Assignee: Apache Spark
> Attachments: spark-18207.txt
>
>
> I have 2 wide dataframes that contain nested data structures, when I explode 
> one of the dataframes, it doesn't include records with an empty nested 
> structure (outer explode not supported).  So, I create a similar dataframe 
> with null values and union them together.  See SPARK-13721 for more details 
> as to why I have to do this.
> I was hoping that SPARK-16845 was going to address my issue, but it does not. 
>  I was asked by [~lwlin] to open this JIRA.  
> I will attach a code snippet that can be pasted into spark-shell that 
> duplicates my code and the exception.  This worked just fine in Spark 1.6.x.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in 
> stage 5.0 failed 4 times, most recent failure: Lost task 35.3 in stage 5.0 
> (TID 812, somehost.mydomain.com, executor 8): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18207) class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2016-11-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631321#comment-15631321
 ] 

Apache Spark commented on SPARK-18207:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/15745

> class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> 
>
> Key: SPARK-18207
> URL: https://issues.apache.org/jira/browse/SPARK-18207
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Don Drake
> Attachments: spark-18207.txt
>
>
> I have 2 wide dataframes that contain nested data structures, when I explode 
> one of the dataframes, it doesn't include records with an empty nested 
> structure (outer explode not supported).  So, I create a similar dataframe 
> with null values and union them together.  See SPARK-13721 for more details 
> as to why I have to do this.
> I was hoping that SPARK-16845 was going to address my issue, but it does not. 
>  I was asked by [~lwlin] to open this JIRA.  
> I will attach a code snippet that can be pasted into spark-shell that 
> duplicates my code and the exception.  This worked just fine in Spark 1.6.x.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in 
> stage 5.0 failed 4 times, most recent failure: Lost task 35.3 in stage 5.0 
> (TID 812, somehost.mydomain.com, executor 8): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18237) hive.exec.stagingdir have no effect in spark2.0.1

2016-11-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18237:


Assignee: (was: Apache Spark)

> hive.exec.stagingdir have no effect in spark2.0.1
> -
>
> Key: SPARK-18237
> URL: https://issues.apache.org/jira/browse/SPARK-18237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: ClassNotFoundExp
>
> hive.exec.stagingdir have no effect in spark2.0.1,
> this relevant to https://issues.apache.org/jira/browse/SPARK-11021



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18237) hive.exec.stagingdir have no effect in spark2.0.1

2016-11-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631313#comment-15631313
 ] 

Apache Spark commented on SPARK-18237:
--

User 'ClassNotFoundExp' has created a pull request for this issue:
https://github.com/apache/spark/pull/15744

> hive.exec.stagingdir have no effect in spark2.0.1
> -
>
> Key: SPARK-18237
> URL: https://issues.apache.org/jira/browse/SPARK-18237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: ClassNotFoundExp
>
> hive.exec.stagingdir have no effect in spark2.0.1,
> this relevant to https://issues.apache.org/jira/browse/SPARK-11021



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18237) hive.exec.stagingdir have no effect in spark2.0.1

2016-11-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18237:


Assignee: Apache Spark

> hive.exec.stagingdir have no effect in spark2.0.1
> -
>
> Key: SPARK-18237
> URL: https://issues.apache.org/jira/browse/SPARK-18237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: ClassNotFoundExp
>Assignee: Apache Spark
>
> hive.exec.stagingdir have no effect in spark2.0.1,
> this relevant to https://issues.apache.org/jira/browse/SPARK-11021



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-11-02 Thread SathyaNarayanan Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631204#comment-15631204
 ] 

SathyaNarayanan Srinivasan edited comment on SPARK-18200 at 11/3/16 2:31 AM:
-

Thank you  Dongjoon Hyun and  Denny Lee for taking a very serious consideration 
to my question in Stack Overflow 
(http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity).
I am in the process of implementing and testing the proposed solution and the 
reported answers work fine. Thanks


was (Author: sathyasrini):
Thank you  Dongjoon Hyun and  Denny Lee for taking a very serious consideration 
to my question in Stack Overflow. I am in the process of implementing and 
testing the proposed solution and the reported answers work fine. Thanks

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>  Labels: graph, graphx
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-02 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631292#comment-15631292
 ] 

Liang-Chi Hsieh edited comment on SPARK-18209 at 11/3/16 2:30 AM:
--

I think the disallowed one is to use CTEs in a subquery. The parser rule shows 
that:

{code}
queryPrimary
: querySpecification  #queryPrimaryDefault
| TABLE tableIdentifier  #table
| inlineTable   #inlineTableDefault1
| '(' queryNoWith  ')'   #subquery
;
{code}

You can see a subquery is a queryNoWith.



was (Author: viirya):
I think the disallowed one is to use CTEs in a subquery. The parser rule shows 
that:

{code}
queryPrimary
: querySpecification  #queryPrimaryDefault
| TABLE tableIdentifier  #table
| inlineTable   #inlineTableDefault1
| '(' queryNoWith  ')'   #subquery
;
{code}


> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-02 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631292#comment-15631292
 ] 

Liang-Chi Hsieh edited comment on SPARK-18209 at 11/3/16 2:29 AM:
--

I think the disallowed one is to use CTEs in a subquery. The parser rule shows 
that:

{code}
queryPrimary
: querySpecification  #queryPrimaryDefault
| TABLE tableIdentifier  #table
| inlineTable   #inlineTableDefault1
| '(' queryNoWith  ')'   #subquery
;
{code}



was (Author: viirya):
I think the disallowed one is to use CTEs in a subquery. The parser rule shows 
that:

{code}
queryPrimary
: querySpecification
#queryPrimaryDefault
| TABLE tableIdentifier 
#table
| inlineTable   
#inlineTableDefault1
| '(' queryNoWith  ')'  
#subquery
;
{code}


> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-02 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631292#comment-15631292
 ] 

Liang-Chi Hsieh commented on SPARK-18209:
-

I think the disallowed one is to use CTEs in a subquery. The parser rule shows 
that:

{code}
queryPrimary
: querySpecification
#queryPrimaryDefault
| TABLE tableIdentifier 
#table
| inlineTable   
#inlineTableDefault1
| '(' queryNoWith  ')'  
#subquery
;
{code}


> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add groupKFold to CrossValidator

2016-11-02 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631286#comment-15631286
 ] 

Vincent commented on SPARK-17055:
-

[~srowen] No offense. Maybe we can invite more ppl to have a look at this 
issue? I saw [~mengxr], [~josephkb] and [~sethah] did works similar to this 
one. How do you guys think of this one? Do you all agree that we drop this?

P.S. leave aside the coding for now, it probably needs more work to make a 
completed PR. :)

Thanks.

> add groupKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18238) WARN Executor: 1 block locks were not released by TID

2016-11-02 Thread Harish (JIRA)
Harish created SPARK-18238:
--

 Summary: WARN Executor: 1 block locks were not released by TID
 Key: SPARK-18238
 URL: https://issues.apache.org/jira/browse/SPARK-18238
 Project: Spark
  Issue Type: Bug
 Environment: 2.0.2 snapshot
Reporter: Harish
Priority: Minor


In spark 2.0.2/hadoop 2.7, i am getting below message. Not sure is this 
impacting my execution.

16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 30541:
[rdd_511_104]
16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 30542:
[rdd_511_105]
16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 30562:
[rdd_511_127]
16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 30571:
[rdd_511_137]
16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 30572:
[rdd_511_138]
16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 30588:
[rdd_511_156]
16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 30603:
[rdd_511_171]
16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 30600:
[rdd_511_168]
16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 30612:
[rdd_511_180]
16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 30622:
[rdd_511_190]
16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 30629:
[rdd_511_197]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-11-02 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631251#comment-15631251
 ] 

Dongjoon Hyun commented on SPARK-18200:
---

Good for you. :)

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>  Labels: graph, graphx
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec

2016-11-02 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631238#comment-15631238
 ] 

Xiao Li commented on SPARK-14922:
-

Sure, thanks!

> Alter Table Drop Partition Using Predicate-based Partition Spec
> ---
>
> Key: SPARK-14922
> URL: https://issues.apache.org/jira/browse/SPARK-14922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Below is allowed in Hive, but not allowed in Spark.
> {noformat}
> alter table ptestfilter drop partition (c='US', d<'2')
> {noformat}
> This example is copied from drop_partitions_filter.q



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec

2016-11-02 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-14922.
---
Resolution: Duplicate

> Alter Table Drop Partition Using Predicate-based Partition Spec
> ---
>
> Key: SPARK-14922
> URL: https://issues.apache.org/jira/browse/SPARK-14922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Below is allowed in Hive, but not allowed in Spark.
> {noformat}
> alter table ptestfilter drop partition (c='US', d<'2')
> {noformat}
> This example is copied from drop_partitions_filter.q



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18237) hive.exec.stagingdir have no effect in spark2.0.1

2016-11-02 Thread ClassNotFoundExp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ClassNotFoundExp updated SPARK-18237:
-
Target Version/s:   (was: 2.0.1)
   Fix Version/s: (was: 2.0.1)

> hive.exec.stagingdir have no effect in spark2.0.1
> -
>
> Key: SPARK-18237
> URL: https://issues.apache.org/jira/browse/SPARK-18237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: ClassNotFoundExp
>
> hive.exec.stagingdir have no effect in spark2.0.1,
> this relevant to https://issues.apache.org/jira/browse/SPARK-11021



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18237) hive.exec.stagingdir have no effect in spark2.0.1

2016-11-02 Thread ClassNotFoundExp (JIRA)
ClassNotFoundExp created SPARK-18237:


 Summary: hive.exec.stagingdir have no effect in spark2.0.1
 Key: SPARK-18237
 URL: https://issues.apache.org/jira/browse/SPARK-18237
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: ClassNotFoundExp
 Fix For: 2.0.1


hive.exec.stagingdir have no effect in spark2.0.1,
this relevant to https://issues.apache.org/jira/browse/SPARK-11021



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17727) PySpark SQL arrays are not immutable, .remove and .pop cause issues

2016-11-02 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631228#comment-15631228
 ] 

Hyukjin Kwon commented on SPARK-17727:
--

Hi [~srowen], would this JIRA be resolved with an action?

> PySpark SQL arrays are not immutable, .remove and .pop cause issues
> ---
>
> Key: SPARK-17727
> URL: https://issues.apache.org/jira/browse/SPARK-17727
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
> Environment: OS X and Linux (Amazon Linux AMI release 2016.03), 
> Python 2.x
>Reporter: Ganesh Sivalingam
>
> When having one column of a DataFrame as an array, for example:
> {code}
> +---+---+-+
> |join_on|  a|b|
> +---+---+-+
> |  1|  1|[1, 2, 3]|
> |  1|  2|[1, 2, 3]|
> |  1|  3|[1, 2, 3]|
> +---+---+-+
> {code}
> If I try to remove the value in column a from the array in column b using 
> python's `list.remove(val)` function. It works, however, after running a 
> second manipulation of the dataframe it fails with an error saying that the 
> item (value in column a) is not present.
> So PySpark is re-running the `list.remove()` but on the already altered 
> list/array.
> Below is a minimal example, which I think should work, however exhibits this 
> issue:
> {code:python}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> import numpy as np
> cols = ['join_on', 'a']
> vals = [
> (1, 1),
> (1, 2),
> (1, 3)
> ]
> df = sqlContext.createDataFrame(vals, cols)
> df_of_arrays = df\
> .groupBy('join_on')\
> .agg(F.collect_list('a').alias('b'))
> df = df\
> .join(df_of_arrays, on='join_on')
> df.show()
> def rm_element(a, list_a):
> list_a.remove(a)
> return list_a
> rm_element_udf = F.udf(rm_element, T.ArrayType(T.LongType()))
> df = df.withColumn('one_removed', rm_element_udf("a", "b"))
> df.show()
> answer = df.withColumn('av', F.udf(lambda a: 
> float(np.mean(a)))('one_removed'))
> answer.show()
> {code}
> This can then be fixed by changing the rm_element function to:
> {code:python}
> def rm_element(a, list_a):
> tmp_list_a = copy.deepcopy(list_a)
> tmp_list_a.remove(a)
> return tmp_list_a
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5151) Parquet Predicate Pushdown Does Not Work with Nested Structures.

2016-11-02 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-5151.
-
Resolution: Duplicate

I am going to resolve this as a duplicate. The reason why I resolve this is, 
simply It seems SPARK-17636 is more active and not stale.

Please revoke my action if anyone feels this is inappropriate.

> Parquet Predicate Pushdown Does Not Work with Nested Structures.
> 
>
> Key: SPARK-5151
> URL: https://issues.apache.org/jira/browse/SPARK-5151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: pyspark, spark-ec2 created cluster
>Reporter: Brad Willard
>  Labels: parquet, pyspark, sql
>
> I have json files of objects created with a nested structure roughly of the 
> formof the form:
> { id: 123, event: "login", meta_data: {'user: "user1"}}
> 
> { id: 125, event: "login", meta_data: {'user: "user2"}}
> I load the data via spark with
> rdd = sql_context.jsonFile()
> # save it as a parquet file
> rdd.saveAsParquetFile()
> rdd = sql_context.parquetFile()
> rdd.registerTempTable('events')
> so if I run this query it works without issue if predicate pushdown is 
> disabled
> select count(1) from events where meta_data.user = "user1"
> if I enable predicate pushdown I get an error saying meta_data.user is not in 
> the schema
> Py4JJavaError: An error occurred while calling o218.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 125 
> in stage 12.0 failed 4 times, most recent failure: Lost task 125.3 in stage 
> 12.0 (TID 6164, ): java.lang.IllegalArgumentException: Column [user] was not 
> found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
> .
> I expect this is actually related to another bug I filed where nested 
> structure is not preserved with spark sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-11-02 Thread SathyaNarayanan Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631204#comment-15631204
 ] 

SathyaNarayanan Srinivasan commented on SPARK-18200:


Thank you  Dongjoon Hyun and  Denny Lee for taking a very serious consideration 
to my question in Stack Overflow. I am in the process of implementing and 
testing the proposed solution and the reported answers work fine. Thanks

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>  Labels: graph, graphx
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17049) LAG function fails when selecting all columns

2016-11-02 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-17049.
--
Resolution: Cannot Reproduce

I am resolving this JIRA as Cannot Reproduce as described in the wiki 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

It seems I can't reproduce this too.

{code}
spark-sql> create table a as select 1 as col;
Time taken: 3.24 seconds
spark-sql> select *, lag(col) over (order by col) as prev from a;
1   NULL
{code}


> LAG function fails when selecting all columns
> -
>
> Key: SPARK-17049
> URL: https://issues.apache.org/jira/browse/SPARK-17049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Gokhan Civan
>
> In version 1.6.1, the queries
> create table a as select 1 as col;
> select *, lag(col) over (order by col) as prev from a;
> successfully produce the table
> col  prev
> 1null
> However, in version 2.0.0, this fails with the error
> org.apache.spark.sql.AnalysisException: Window Frame RANGE BETWEEN UNBOUNDED 
> PRECEDING AND CURRENT ROW must match the required frame ROWS BETWEEN 1 
> PRECEDING AND 1 PRECEDING;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$29$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:1785)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$29$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:1781)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:170)
> ...
> On the other hand, the query works if * is replaced with col as in
> select col, lag(col) over (order by col) as prev from a;
> It also works as follows:
> select col, lag(col) over (order by col ROWS BETWEEN 1 PRECEDING AND 1 
> PRECEDING) as prev from a;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16210) DataFrame.drop(colName) fails if another column has a period in its name

2016-11-02 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16210.
--
Resolution: Cannot Reproduce

I can't reproduce this too

{code}
scala> val rdd = sc.makeRDD("""{"x.y": 5, "abc": 10}""" :: Nil)
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[5] at makeRDD at 
:24

scala> spark.read.json(rdd).drop("abc")
res6: org.apache.spark.sql.DataFrame = [x.y: bigint]
{code}

and I am going to resolve this JIRA as it seems it complies this

{quote}
For issues that can't be reproduced against master as reported, resolve as 
Cannot Reproduce
{quote}

in wiki https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> DataFrame.drop(colName) fails if another column has a period in its name
> 
>
> Key: SPARK-16210
> URL: https://issues.apache.org/jira/browse/SPARK-16210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Spark 1.6.1 on Databricks
>Reporter: Simeon Simeonov
>  Labels: dataframe, sql
>
> The following code fails with {{org.apache.spark.sql.AnalysisException: 
> cannot resolve 'x.y' given input columns: [abc, x.y]}} because of the way 
> {{drop()}} uses {{select()}} under the covers.
> {code}
> val rdd = sc.makeRDD("""{"x.y": 5, "abc": 10}""" :: Nil)
> sqlContext.read.json(rdd).drop("abc")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15174) DataFrame does not have correct number of rows after dropDuplicates

2016-11-02 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon closed SPARK-15174.

Resolution: Cannot Reproduce

I can't reproduce this in the current master. So, I am going to mark this as 
Cannot Reproduce. Please revoke my action if this is inappropriate.

{code}
scala> val df1 = spark.read.json(input)
org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON at 
empty. It must be specified manually;
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$17.apply(DataSource.scala:438)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$17.apply(DataSource.scala:438)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:437)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:297)
  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:250)
  ... 48 elided

scala> val df2 = spark.read.json(input).dropDuplicates
org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON at 
empty. It must be specified manually;
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$17.apply(DataSource.scala:438)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$17.apply(DataSource.scala:438)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:437)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:297)
  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:250)
  ... 48 elided
{code}

> DataFrame does not have correct number of rows after dropDuplicates
> ---
>
> Key: SPARK-15174
> URL: https://issues.apache.org/jira/browse/SPARK-15174
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Ian Hellstrom
>
> If you read an empty file/folder with the {{SQLContext.read()}} function and 
> call {{DataFrame.dropDuplicates()}}, the number of rows is incorrect.
> {code}
> val input = "hdfs:///some/empty/directory"
> val df1 = sqlContext.read.json(input)
> val df2 = sqlContext.read.json(input).dropDuplicates
> df1.count == 0 // true
> df1.rdd.isEmpty // true
> df2.count == 0 // false: it's actually reported as 1
> df2.rdd.isEmpty // false
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18236) Reduce memory usage of Spark UI and HistoryServer by reducing duplicate objects

2016-11-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18236:


Assignee: Josh Rosen  (was: Apache Spark)

> Reduce memory usage of Spark UI and HistoryServer by reducing duplicate 
> objects
> ---
>
> Key: SPARK-18236
> URL: https://issues.apache.org/jira/browse/SPARK-18236
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> When profiling heap dumps from the Spark History Server and live Spark web 
> UIs, I found a tremendous amount of memory being wasted on duplicate objects 
> and strings. A few small changes can cut per-task UI memory by half or more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18222) Use math instead of Math

2016-11-02 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng closed SPARK-18222.

Resolution: Not A Problem

> Use math instead of Math
> 
>
> Key: SPARK-18222
> URL: https://issues.apache.org/jira/browse/SPARK-18222
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core, SQL
>Reporter: zhengruifeng
>Priority: Minor
>
> Unify math/Math usage: use {{scala.math}} instead of {{java.lang.Math}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18236) Reduce memory usage of Spark UI and HistoryServer by reducing duplicate objects

2016-11-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18236:


Assignee: Apache Spark  (was: Josh Rosen)

> Reduce memory usage of Spark UI and HistoryServer by reducing duplicate 
> objects
> ---
>
> Key: SPARK-18236
> URL: https://issues.apache.org/jira/browse/SPARK-18236
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> When profiling heap dumps from the Spark History Server and live Spark web 
> UIs, I found a tremendous amount of memory being wasted on duplicate objects 
> and strings. A few small changes can cut per-task UI memory by half or more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18236) Reduce memory usage of Spark UI and HistoryServer by reducing duplicate objects

2016-11-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631147#comment-15631147
 ] 

Apache Spark commented on SPARK-18236:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15743

> Reduce memory usage of Spark UI and HistoryServer by reducing duplicate 
> objects
> ---
>
> Key: SPARK-18236
> URL: https://issues.apache.org/jira/browse/SPARK-18236
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> When profiling heap dumps from the Spark History Server and live Spark web 
> UIs, I found a tremendous amount of memory being wasted on duplicate objects 
> and strings. A few small changes can cut per-task UI memory by half or more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec

2016-11-02 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631137#comment-15631137
 ] 

Hyukjin Kwon commented on SPARK-14922:
--

Hi [~smilegator], would this be supposed to be closed as a duplicate?

> Alter Table Drop Partition Using Predicate-based Partition Spec
> ---
>
> Key: SPARK-14922
> URL: https://issues.apache.org/jira/browse/SPARK-14922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Below is allowed in Hive, but not allowed in Spark.
> {noformat}
> alter table ptestfilter drop partition (c='US', d<'2')
> {noformat}
> This example is copied from drop_partitions_filter.q



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14840) Cannot drop a table which has the name starting with 'or'

2016-11-02 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631134#comment-15631134
 ] 

Hyukjin Kwon commented on SPARK-14840:
--

Please revoke my action if this is inappropriate.

> Cannot drop a table which has the name starting with 'or'
> -
>
> Key: SPARK-14840
> URL: https://issues.apache.org/jira/browse/SPARK-14840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Kwangwoo Kim
>
> sqlContext("drop table tmp.order")  
> The above code makes error as following: 
> 6/04/22 14:27:17 INFO ParseDriver: Parsing command: drop table tmp.order
> 16/04/22 14:27:19 INFO ParseDriver: Parse Completed
> 16/04/22 14:27:19 WARN DropTable: [1.5] failure: identifier expected
> tmp.order
> ^
> java.lang.RuntimeException: [1.5] failure: identifier expected
> tmp.order
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
>   at org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:62)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
>   at 
> $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC.(:37)
>   at $line15.$read$$iwC$$iwC$$iwC.(:39)
>   at $line15.$read$$iwC$$iwC.(:41)
>   at $line15.$read$$iwC.(:43)
>   at $line15.$read.(:45)
>   at $line15.$read$.(:49)
>   at $line15.$read$.()
>   at $line15.$eval$.(:7)
>   at $line15.$eval$.()
>   at $line15.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
>   at org.apa

[jira] [Resolved] (SPARK-14840) Cannot drop a table which has the name starting with 'or'

2016-11-02 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-14840.
--
Resolution: Duplicate

I can't reproduce this against the current master.

{code}
scala> sql("create table order(a int )")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("drop table order")
res1: org.apache.spark.sql.DataFrame = []
{code}

and it seems it is the subset of SPARK-14762

> Cannot drop a table which has the name starting with 'or'
> -
>
> Key: SPARK-14840
> URL: https://issues.apache.org/jira/browse/SPARK-14840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Kwangwoo Kim
>
> sqlContext("drop table tmp.order")  
> The above code makes error as following: 
> 6/04/22 14:27:17 INFO ParseDriver: Parsing command: drop table tmp.order
> 16/04/22 14:27:19 INFO ParseDriver: Parse Completed
> 16/04/22 14:27:19 WARN DropTable: [1.5] failure: identifier expected
> tmp.order
> ^
> java.lang.RuntimeException: [1.5] failure: identifier expected
> tmp.order
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
>   at org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:62)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
>   at 
> $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
>   at $line15.$read$$iwC$$iwC$$iwC$$iwC.(:37)
>   at $line15.$read$$iwC$$iwC$$iwC.(:39)
>   at $line15.$read$$iwC$$iwC.(:41)
>   at $line15.$read$$iwC.(:43)
>   at $line15.$read.(:45)
>   at $line15.$read$.(:49)
>   at $line15.$read$.()
>   at $line15.$eval$.(:7)
>   at $line15.$eval$.()
>   at $line15.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> scala.tools.nsc.util.Scala

[jira] [Created] (SPARK-18236) Reduce memory usage of Spark UI and HistoryServer by reducing duplicate objects

2016-11-02 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-18236:
--

 Summary: Reduce memory usage of Spark UI and HistoryServer by 
reducing duplicate objects
 Key: SPARK-18236
 URL: https://issues.apache.org/jira/browse/SPARK-18236
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Josh Rosen
Assignee: Josh Rosen


When profiling heap dumps from the Spark History Server and live Spark web UIs, 
I found a tremendous amount of memory being wasted on duplicate objects and 
strings. A few small changes can cut per-task UI memory by half or more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17470) unify path for data source table and locationUri for hive serde table

2016-11-02 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17470.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15024
[https://github.com/apache/spark/pull/15024]

> unify path for data source table and locationUri for hive serde table
> -
>
> Key: SPARK-17470
> URL: https://issues.apache.org/jira/browse/SPARK-17470
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14443) parse_url() does not escape query parameters

2016-11-02 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631103#comment-15631103
 ] 

Hyukjin Kwon commented on SPARK-14443:
--

I just tried this too. It seems we can do this via manually escaping them as 
below:
{code}
spark-sql> select 
parse_url('http://1168.xg4ken.com/media/redir.php?prof=457&camp=67116&affcode=kw54&k_inner_url_encoded=1&cid=adwords&kdv=Desktop&url[]=http%3A%2F%2Fwww.landroverusa.com%2Fvehicles%2Frange-rover-sport-off-road-suv%2Findex.html%3Futm_content%3Dcontent%26utm_source%fb%26utm_medium%3Dcpc%26utm_term%3DAdwords_Brand_Range_Rover_Sport%26utm_campaign%3DFB_Land_Rover_Brand',
 'QUERY', 'url\\[\\]');
http%3A%2F%2Fwww.landroverusa.com%2Fvehicles%2Frange-rover-sport-off-road-suv%2Findex.html%3Futm_content%3Dcontent%26utm_source%fb%26utm_medium%3Dcpc%26utm_term%3DAdwords_Brand_Range_Rover_Sport%26utm_campaign%3DFB_Land_Rover_Brand
{code}

> parse_url() does not escape query parameters
> 
>
> Key: SPARK-14443
> URL: https://issues.apache.org/jira/browse/SPARK-14443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks
>Reporter: Simeon Simeonov
>  Labels: functions, sql
>
> To reproduce, run the following SparkSQL statement:
> {code}
> select 
> parse_url('http://1168.xg4ken.com/media/redir.php?prof=457&camp=67116&affcode=kw54&k_inner_url_encoded=1&cid=adwords&kdv=Desktop&url[]=http%3A%2F%2Fwww.landroverusa.com%2Fvehicles%2Frange-rover-sport-off-road-suv%2Findex.html%3Futm_content%3Dcontent%26utm_source%fb%26utm_medium%3Dcpc%26utm_term%3DAdwords_Brand_Range_Rover_Sport%26utm_campaign%3DFB_Land_Rover_Brand',
>  'QUERY', 'url[]')
> {code}
> The exception is ultimately caused by
> {code}
> java.util.regex.PatternSyntaxException: Unclosed character class near index 17
> (&|^)url[]=([^&]*)
>  ^
> {code}
> Looks like the code is building a regex internally without escaping the 
> passed in query parameter name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-18193) queueStream not updated if rddQueue.add after create queueStream in Java

2016-11-02 Thread Hubert Kang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hubert Kang reopened SPARK-18193:
-

While it's inconsistent with that in QueueStream.scala.
something is pushed to the queue after ssc.start() and could be recognized 
successfully.

It's expected so that live data stream could be handled.

> queueStream not updated if rddQueue.add after create queueStream in Java
> 
>
> Key: SPARK-18193
> URL: https://issues.apache.org/jira/browse/SPARK-18193
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.1
>Reporter: Hubert Kang
>
> Within 
> examples\src\main\java\org\apache\spark\examples\streaming\JavaQueueStream.java,
>  no any data is deteceted if below code to put something to rddQueue is 
> executed after queueStream is created (line 65).
> for (int i = 0; i < 30; i++) {
>   rddQueue.add(ssc.sparkContext().parallelize(list));
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14319) Speed up group-by aggregates

2016-11-02 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631091#comment-15631091
 ] 

Hyukjin Kwon commented on SPARK-14319:
--

Hey [~sameerag], I guess we might be able to take an action to this JIRA too.

> Speed up group-by aggregates
> 
>
> Key: SPARK-14319
> URL: https://issues.apache.org/jira/browse/SPARK-14319
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sameer Agarwal
>
> Aggregates with key in SparkSQL are almost 30x slower than aggregates with 
> key. This master JIRA tracks our attempts to optimize them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14197) Error: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to name on unresolved object, tree: unresolvedalias(if ((imei#33365 = 1AA10007)) imei#3336

2016-11-02 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon closed SPARK-14197.

Resolution: Cannot Reproduce

I am closing this as Cannot Reproduce as I can't reproduce this against the 
current master.

{code}
scala> val df = spark.range(1).selectExpr("'12AA34' as imei", "'123' as MAC")
df: org.apache.spark.sql.DataFrame = [imei: string, MAC: string]

scala> df.createOrReplaceTempView("babu1")

scala> spark.sql("select count(*) from babu1 where imei in ( select  if (imei = 
'1AA10007',imei,NULL)  from babu1)").show()
++
|count(1)|
++
|   0|
++
{code}

> Error: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid 
> call to name on unresolved object, tree: unresolvedalias(if ((imei#33365 = 
> 1AA10007)) imei#33365 else cast(null as string)) (state=,code=0)
> --
>
> Key: SPARK-14197
> URL: https://issues.apache.org/jira/browse/SPARK-14197
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Chetan Bhat
>Priority: Minor
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> In Beeline when the SQL is executed using Spark the following error occurs.
> select count(*) from babu1 where imei in ( select  if (imei = 
> '1AA10007',imei,NULL)  from babu1);
> Error: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid 
> call to name on unresolved object, tree: unresolvedalias(if ((imei#33365 = 
> 1AA10007)) imei#33365 else cast(null as string)) (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14195) Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 'select' 'MAC' 'from' in expression specification; line 1 pos 16 (state=,code=0)

2016-11-02 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631071#comment-15631071
 ] 

Hyukjin Kwon edited comment on SPARK-14195 at 11/3/16 12:52 AM:


I am closing this as Cannot Reproduce as I can't against the current master.

{code}
scala> val df = spark.range(1).selectExpr("'12AA34' as imei", "'123' as MAC")
df: org.apache.spark.sql.DataFrame = [imei: string, MAC: string]

scala> df.createOrReplaceTempView("testolap")

scala> spark.sql("select a.imei, (select MAC from testolap where imei like 
'%AA%'), MAC from testolap a limit 10").show()
+--++---+
|  imei|scalarsubquery()|MAC|
+--++---+
|12AA34| 123|123|
+--++---+
{code}


was (Author: hyukjin.kwon):
I am closing this as Cannot Reproduce as I can't against the current master.

{code}
scala> spark.sql("select a.imei, (select MAC from testolap where imei like 
'%AA%'), MAC from testolap a limit 10").show()
+--++---+
|  imei|scalarsubquery()|MAC|
+--++---+
|12AA34| 123|123|
+--++---+
{code}

> Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
> 'select' 'MAC' 'from' in expression specification; line 1 pos 16 
> (state=,code=0)
> ---
>
> Key: SPARK-14195
> URL: https://issues.apache.org/jira/browse/SPARK-14195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: SUSE 11
>Reporter: Chetan Bhat
>Priority: Minor
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> In Beeline when the SQL is executed using Spark the following error is 
> displayed.
> select a.imei, (select MAC from testolap where imei like '%AA%' 1) MAC from 
> testolap a limit 10;
> Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
> 'select' 'MAC' 'from' in expression specification; line 1 pos 16 
> (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14196) Error: org.apache.spark.sql.AnalysisException: cannot resolve 'query_alias_fix_conflicts_0._c0'; (state=,code=0)

2016-11-02 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon closed SPARK-14196.

Resolution: Cannot Reproduce

I am closing this as Cannot Reproduce as I can't against the current master.

{code}
scala> val df = spark.range(1).selectExpr("'12AA34' as imei", "'123' as MAC")
df: org.apache.spark.sql.DataFrame = [imei: string, MAC: string]

scala> df.createOrReplaceTempView("testolap")

scala> spark.sql("select count(*) from testolap where imei in ( select case 
when imei like '%007%' then imei end from testolap)").show()
++
|count(1)|
++
|   0|
++
{code}

> Error: org.apache.spark.sql.AnalysisException: cannot resolve 
> 'query_alias_fix_conflicts_0._c0'; (state=,code=0)
> 
>
> Key: SPARK-14196
> URL: https://issues.apache.org/jira/browse/SPARK-14196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Chetan Bhat
>Priority: Minor
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> In Beeline when the SQL is executed using Spark the following error is 
> displayed.
> select count(*) from testolap where imei in ( select case when imei like 
> '%007%' then imei end from testolap);
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 
> 'query_alias_fix_conflicts_0._c0'; (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14195) Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 'select' 'MAC' 'from' in expression specification; line 1 pos 16 (state=,code=0)

2016-11-02 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon closed SPARK-14195.

Resolution: Cannot Reproduce

I am closing this as Cannot Reproduce as I can't against the current master.

{code}
scala> spark.sql("select a.imei, (select MAC from testolap where imei like 
'%AA%'), MAC from testolap a limit 10").show()
+--++---+
|  imei|scalarsubquery()|MAC|
+--++---+
|12AA34| 123|123|
+--++---+
{code}

> Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
> 'select' 'MAC' 'from' in expression specification; line 1 pos 16 
> (state=,code=0)
> ---
>
> Key: SPARK-14195
> URL: https://issues.apache.org/jira/browse/SPARK-14195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: SUSE 11
>Reporter: Chetan Bhat
>Priority: Minor
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> In Beeline when the SQL is executed using Spark the following error is 
> displayed.
> select a.imei, (select MAC from testolap where imei like '%AA%' 1) MAC from 
> testolap a limit 10;
> Error: org.apache.spark.sql.AnalysisException: cannot recognize input near 
> 'select' 'MAC' 'from' in expression specification; line 1 pos 16 
> (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16808) History Server main page does not honor APPLICATION_WEB_PROXY_BASE

2016-11-02 Thread Vinayak Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630940#comment-15630940
 ] 

Vinayak Joshi commented on SPARK-16808:
---

Added a pull request with my take at this issue as well.

> History Server main page does not honor APPLICATION_WEB_PROXY_BASE
> --
>
> Key: SPARK-16808
> URL: https://issues.apache.org/jira/browse/SPARK-16808
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>
> The root of the history server is rendered dynamically with javascript, and 
> this doesn't honor APPLICATION_WEB_PROXY_BASE: 
> https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/ui/static/historypage-template.html#L67
> Other links in the history server do honor it: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L146
> This means the links on the history server root page are broken when deployed 
> behind a proxy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16808) History Server main page does not honor APPLICATION_WEB_PROXY_BASE

2016-11-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630935#comment-15630935
 ] 

Apache Spark commented on SPARK-16808:
--

User 'vijoshi' has created a pull request for this issue:
https://github.com/apache/spark/pull/15742

> History Server main page does not honor APPLICATION_WEB_PROXY_BASE
> --
>
> Key: SPARK-16808
> URL: https://issues.apache.org/jira/browse/SPARK-16808
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Michael Gummelt
>
> The root of the history server is rendered dynamically with javascript, and 
> this doesn't honor APPLICATION_WEB_PROXY_BASE: 
> https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/ui/static/historypage-template.html#L67
> Other links in the history server do honor it: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L146
> This means the links on the history server root page are broken when deployed 
> behind a proxy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18235) ml.ALSModel function parity: ALSModel should support recommendforAll

2016-11-02 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-18235:
---
Description: 
For function parity with MatrixFactorizationModel, ALS model should support API:
recommendUsersForProducts
recommendProductsForUsers

There're another two APIs:
recommendProducts:
recommendUsers:

The function requirement comes from mailing-list: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-using-collaborative-filtering-in-MLlib-td19677.html
 


  was:
For function parity, ALS model should support API:



> ml.ALSModel function parity: ALSModel should support recommendforAll
> 
>
> Key: SPARK-18235
> URL: https://issues.apache.org/jira/browse/SPARK-18235
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>
> For function parity with MatrixFactorizationModel, ALS model should support 
> API:
> recommendUsersForProducts
> recommendProductsForUsers
> There're another two APIs:
> recommendProducts:
> recommendUsers:
> The function requirement comes from mailing-list: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-using-collaborative-filtering-in-MLlib-td19677.html
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18235) ml.ALSModel function parity: ALSModel should support recommendforAll

2016-11-02 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630878#comment-15630878
 ] 

yuhao yang commented on SPARK-18235:


I can work on the implementation. Appreciate if there's any suggestion.

> ml.ALSModel function parity: ALSModel should support recommendforAll
> 
>
> Key: SPARK-18235
> URL: https://issues.apache.org/jira/browse/SPARK-18235
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>
> For function parity with MatrixFactorizationModel, ALS model should support 
> API:
> recommendUsersForProducts
> recommendProductsForUsers
> There're another two APIs:
> recommendProducts:
> recommendUsers:
> The function requirement comes from mailing-list: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-using-collaborative-filtering-in-MLlib-td19677.html
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18235) ml.ALSModel function parity: ALSModel should support recommendforAll

2016-11-02 Thread yuhao yang (JIRA)
yuhao yang created SPARK-18235:
--

 Summary: ml.ALSModel function parity: ALSModel should support 
recommendforAll
 Key: SPARK-18235
 URL: https://issues.apache.org/jira/browse/SPARK-18235
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: yuhao yang


For function parity, ALS model should support API:




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-11-02 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630864#comment-15630864
 ] 

Dongjoon Hyun commented on SPARK-18200:
---

The described scenario is also tested.
{code}
scala> import org.apache.spark.graphx.{GraphLoader, PartitionStrategy}

scala> val filepath = "/tmp/ca-HepTh.txt"

scala> val graph = GraphLoader.edgeListFile(sc, filepath, 
true).partitionBy(PartitionStrategy.RandomVertexCut)

scala> val triCounts = graph.triangleCount().vertices

scala> triCounts.toDF().show()
+-+---+
|   _1| _2|
+-+---+
|50130|  2|
|20484| 11|
|10598|190|
|31760| 29|
{code}

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>  Labels: graph, graphx
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18086) Regression: Hive variables no longer work in Spark 2.0

2016-11-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630855#comment-15630855
 ] 

Reynold Xin commented on SPARK-18086:
-

Because the execution code no longer depends on Hive's internals. Only the 
catalog (metastore) depends on Hive internals, which doesn't use session 
variables I believe.


> Regression: Hive variables no longer work in Spark 2.0
> --
>
> Key: SPARK-18086
> URL: https://issues.apache.org/jira/browse/SPARK-18086
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ryan Blue
>
> The behavior of variables in the SQL shell has changed from 1.6 to 2.0. 
> Specifically, --hivevar name=value and {{SET hivevar:name=value}} no longer 
> work. Queries that worked correctly in 1.6 will either fail or produce 
> unexpected results in 2.0 so I think this is a regression that should be 
> addressed.
> Hive and Spark 1.6 work like this:
> 1. Command-line args --hiveconf and --hivevar can be used to set session 
> properties. --hiveconf properties are added to the Hadoop Configuration.
> 2. {{SET}} adds a Hive Configuration property, {{SET hivevar:=}} 
> adds a Hive var.
> 3. Hive vars can be substituted into queries by name, and Configuration 
> properties can be substituted using {{hiveconf:name}}.
> In 2.0, hiveconf, sparkconf, and conf variable prefixes are all removed, then 
> the value in SQLConf for the rest of the key is returned. SET adds properties 
> to the session config and (according to [a 
> comment|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala#L28])
>  the Hadoop configuration "during I/O".
> {code:title=Hive and Spark 1.6.1 behavior}
> [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2
> spark-sql> select "${hiveconf:test.conf}";
> 1
> spark-sql> select "${test.conf}";
> ${test.conf}
> spark-sql> select "${hivevar:test.var}";
> 2
> spark-sql> select "${test.var}";
> 2
> spark-sql> set test.set=3;
> SET test.set=3
> spark-sql> select "${test.set}"
> "${test.set}"
> spark-sql> select "${hivevar:test.set}"
> "${hivevar:test.set}"
> spark-sql> select "${hiveconf:test.set}"
> 3
> spark-sql> set hivevar:test.setvar=4;
> SET hivevar:test.setvar=4
> spark-sql> select "${hivevar:test.setvar}";
> 4
> spark-sql> select "${test.setvar}";
> 4
> {code}
> {code:title=Spark 2.0.0 behavior}
> [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2
> spark-sql> select "${hiveconf:test.conf}";
> 1
> spark-sql> select "${test.conf}";
> 1
> spark-sql> select "${hivevar:test.var}";
> ${hivevar:test.var}
> spark-sql> select "${test.var}";
> ${test.var}
> spark-sql> set test.set=3;
> test.set3
> spark-sql> select "${test.set}";
> 3
> spark-sql> set hivevar:test.setvar=4;
> hivevar:test.setvar  4
> spark-sql> select "${hivevar:test.setvar}";
> 4
> spark-sql> select "${test.setvar}";
> ${test.setvar}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18024) Introduce an internal commit protocol API along with OutputCommitter implementation

2016-11-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18024:

Summary: Introduce an internal commit protocol API along with 
OutputCommitter implementation  (was: Introduce a commit protocol API along 
with OutputCommitter implementation)

> Introduce an internal commit protocol API along with OutputCommitter 
> implementation
> ---
>
> Key: SPARK-18024
> URL: https://issues.apache.org/jira/browse/SPARK-18024
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> This commit protocol API should wrap around Hadoop's output committer. Later 
> we can expand the API to cover streaming commits.
> The existing Hadoop output committer API is insufficient for streaming use 
> cases:
> 1. It has no way for tasks to pass information back to the driver.
> 2. It relies on the weird Hadoop hashmap to pass information from the driver 
> to the executors, largely because there is no support for language 
> integration and serialization in Hadoop MapReduce. Spark has more natural 
> support for passing information through automatic closure serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18086) Regression: Hive variables no longer work in Spark 2.0

2016-11-02 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630849#comment-15630849
 ] 

Ryan Blue commented on SPARK-18086:
---

What is the rationale for propagating configuration but not variables?

This also handles the case where there are collisions, though that is unlikely.

> Regression: Hive variables no longer work in Spark 2.0
> --
>
> Key: SPARK-18086
> URL: https://issues.apache.org/jira/browse/SPARK-18086
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ryan Blue
>
> The behavior of variables in the SQL shell has changed from 1.6 to 2.0. 
> Specifically, --hivevar name=value and {{SET hivevar:name=value}} no longer 
> work. Queries that worked correctly in 1.6 will either fail or produce 
> unexpected results in 2.0 so I think this is a regression that should be 
> addressed.
> Hive and Spark 1.6 work like this:
> 1. Command-line args --hiveconf and --hivevar can be used to set session 
> properties. --hiveconf properties are added to the Hadoop Configuration.
> 2. {{SET}} adds a Hive Configuration property, {{SET hivevar:=}} 
> adds a Hive var.
> 3. Hive vars can be substituted into queries by name, and Configuration 
> properties can be substituted using {{hiveconf:name}}.
> In 2.0, hiveconf, sparkconf, and conf variable prefixes are all removed, then 
> the value in SQLConf for the rest of the key is returned. SET adds properties 
> to the session config and (according to [a 
> comment|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala#L28])
>  the Hadoop configuration "during I/O".
> {code:title=Hive and Spark 1.6.1 behavior}
> [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2
> spark-sql> select "${hiveconf:test.conf}";
> 1
> spark-sql> select "${test.conf}";
> ${test.conf}
> spark-sql> select "${hivevar:test.var}";
> 2
> spark-sql> select "${test.var}";
> 2
> spark-sql> set test.set=3;
> SET test.set=3
> spark-sql> select "${test.set}"
> "${test.set}"
> spark-sql> select "${hivevar:test.set}"
> "${hivevar:test.set}"
> spark-sql> select "${hiveconf:test.set}"
> 3
> spark-sql> set hivevar:test.setvar=4;
> SET hivevar:test.setvar=4
> spark-sql> select "${hivevar:test.setvar}";
> 4
> spark-sql> select "${test.setvar}";
> 4
> {code}
> {code:title=Spark 2.0.0 behavior}
> [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2
> spark-sql> select "${hiveconf:test.conf}";
> 1
> spark-sql> select "${test.conf}";
> 1
> spark-sql> select "${hivevar:test.var}";
> ${hivevar:test.var}
> spark-sql> select "${test.var}";
> ${test.var}
> spark-sql> set test.set=3;
> test.set3
> spark-sql> select "${test.set}";
> 3
> spark-sql> set hivevar:test.setvar=4;
> hivevar:test.setvar  4
> spark-sql> select "${hivevar:test.setvar}";
> 4
> spark-sql> select "${test.setvar}";
> ${test.setvar}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18024) Introduce a commit protocol API along with OutputCommitter implementation

2016-11-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18024:

Description: 
This commit protocol API should wrap around Hadoop's output committer. Later we 
can expand the API to cover streaming commits.

The existing Hadoop output committer API is insufficient for streaming use 
cases:

1. It has no way for tasks to pass information back to the driver.

2. It relies on the weird Hadoop hashmap to pass information from the driver to 
the executors, largely because there is no support for language integration and 
serialization in Hadoop MapReduce. Spark has more natural support for passing 
information through automatic closure serialization.


  was:
This commit protocol API should wrap around Hadoop's output committer. Later we 
can expand the API to cover streaming commits.



> Introduce a commit protocol API along with OutputCommitter implementation
> -
>
> Key: SPARK-18024
> URL: https://issues.apache.org/jira/browse/SPARK-18024
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> This commit protocol API should wrap around Hadoop's output committer. Later 
> we can expand the API to cover streaming commits.
> The existing Hadoop output committer API is insufficient for streaming use 
> cases:
> 1. It has no way for tasks to pass information back to the driver.
> 2. It relies on the weird Hadoop hashmap to pass information from the driver 
> to the executors, largely because there is no support for language 
> integration and serialization in Hadoop MapReduce. Spark has more natural 
> support for passing information through automatic closure serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-11-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18200:

Target Version/s: 2.0.3, 2.1.0

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>  Labels: graph, graphx
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18214) Simplify RuntimeReplaceable type coercion

2016-11-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18214.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Simplify RuntimeReplaceable type coercion
> -
>
> Key: SPARK-18214
> URL: https://issues.apache.org/jira/browse/SPARK-18214
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> RuntimeReplaceable is used to create aliases for expressions, but the way it 
> deals with type coercion is pretty weird (each expression is responsible for 
> how to handle type coercion, which does not obey the normal implicit type 
> cast rules).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-11-02 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630821#comment-15630821
 ] 

Dongjoon Hyun edited comment on SPARK-18200 at 11/2/16 10:58 PM:
-

Actually, there is a node who doesn't have any neighbors. So, it requested to 
create `VertexSet` with zero initial capacity.


was (Author: dongjoon):
Actually, there is a node whose don't have neighbor. So, it requested to create 
`VertexSet` with zero initial capacity.

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>  Labels: graph, graphx
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-11-02 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630821#comment-15630821
 ] 

Dongjoon Hyun commented on SPARK-18200:
---

Actually, there is a node whose don't have neighbor. So, it requested to create 
`VertexSet` with zero initial capacity.

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>  Labels: graph, graphx
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-11-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18200:


Assignee: Apache Spark

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>Assignee: Apache Spark
>  Labels: graph, graphx
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-11-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18200:


Assignee: (was: Apache Spark)

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>  Labels: graph, graphx
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-11-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630814#comment-15630814
 ] 

Apache Spark commented on SPARK-18200:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15741

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>  Labels: graph, graphx
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-11-02 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630798#comment-15630798
 ] 

Dongjoon Hyun commented on SPARK-18200:
---

Hi, [~dennyglee].
It's due to `OpenHashSet`. I'll make a PR for this.

> GraphX Invalid initial capacity when running triangleCount
> --
>
> Key: SPARK-18200
> URL: https://issues.apache.org/jira/browse/SPARK-18200
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: Databricks, Ubuntu 16.04, macOS Sierra
>Reporter: Denny Lee
>  Labels: graph, graphx
>
> Running GraphX triangle count on large-ish file results in the "Invalid 
> initial capacity" error when running on Spark 2.0 (tested on Spark 2.0, 
> 2.0.1, and 2.0.2).  You can see the results at: http://bit.ly/2eQKWDN
> Running the same code on Spark 1.6 and the query completes without any 
> problems: http://bit.ly/2fATO1M
> As well, running the GraphFrames version of this code runs as well (Spark 
> 2.0, GraphFrames 0.2): http://bit.ly/2fAS8W8
> Reference Stackoverflow question:
> Spark GraphX: requirement failed: Invalid initial capacity 
> (http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18232) Support Mesos CNI

2016-11-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630768#comment-15630768
 ] 

Apache Spark commented on SPARK-18232:
--

User 'mgummelt' has created a pull request for this issue:
https://github.com/apache/spark/pull/15740

> Support Mesos CNI
> -
>
> Key: SPARK-18232
> URL: https://issues.apache.org/jira/browse/SPARK-18232
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Michael Gummelt
>
> Add the ability to launch containers attached to a CNI network: 
> http://mesos.apache.org/documentation/latest/cni/
> This allows for user-pluggable network isolation, including IP-per-container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18232) Support Mesos CNI

2016-11-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18232:


Assignee: (was: Apache Spark)

> Support Mesos CNI
> -
>
> Key: SPARK-18232
> URL: https://issues.apache.org/jira/browse/SPARK-18232
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Michael Gummelt
>
> Add the ability to launch containers attached to a CNI network: 
> http://mesos.apache.org/documentation/latest/cni/
> This allows for user-pluggable network isolation, including IP-per-container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18232) Support Mesos CNI

2016-11-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18232:


Assignee: Apache Spark

> Support Mesos CNI
> -
>
> Key: SPARK-18232
> URL: https://issues.apache.org/jira/browse/SPARK-18232
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Michael Gummelt
>Assignee: Apache Spark
>
> Add the ability to launch containers attached to a CNI network: 
> http://mesos.apache.org/documentation/latest/cni/
> This allows for user-pluggable network isolation, including IP-per-container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18234) Update mode in structured streaming

2016-11-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18234:

Summary: Update mode in structured streaming  (was: Update mode)

> Update mode in structured streaming
> ---
>
> Key: SPARK-18234
> URL: https://issues.apache.org/jira/browse/SPARK-18234
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Priority: Critical
>
> We have this internal, but we should nail down the semantics and expose it to 
> users.  The idea of update mode is that any tuple that changes will be 
> emitted.  Open questions:
>  - do we need to reason about the {{keys}} for a given stream?  For things 
> like the {{foreach}} sink its up to the user.  However, for more end to end 
> use cases such as a JDBC sink, we need to know which row downstream is being 
> updated.
>  - okay to not support files?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18234) Update mode

2016-11-02 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-18234:


 Summary: Update mode
 Key: SPARK-18234
 URL: https://issues.apache.org/jira/browse/SPARK-18234
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Reporter: Michael Armbrust
Priority: Critical


We have this internal, but we should nail down the semantics and expose it to 
users.  The idea of update mode is that any tuple that changes will be emitted. 
 Open questions:
 - do we need to reason about the {{keys}} for a given stream?  For things like 
the {{foreach}} sink its up to the user.  However, for more end to end use 
cases such as a JDBC sink, we need to know which row downstream is being 
updated.
 - okay to not support files?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18230) MatrixFactorizationModel.recommendProducts throws NoSuchElement exception when the user does not exist

2016-11-02 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630741#comment-15630741
 ] 

yuhao yang commented on SPARK-18230:


Perhaps we can use Double.NaN for the case, just to be consistent with 
spark.ml.als.
[~mikaelstaldal] Do you plan to send a PR for this?

> MatrixFactorizationModel.recommendProducts throws NoSuchElement exception 
> when the user does not exist
> --
>
> Key: SPARK-18230
> URL: https://issues.apache.org/jira/browse/SPARK-18230
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.1
>Reporter: Mikael Ståldal
>Priority: Minor
>
> When invoking {{MatrixFactorizationModel.recommendProducts(Int, Int)}} with a 
> non-existing user, a {{java.util.NoSuchElementException}} is thrown:
> {code}
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
>   at scala.collection.IterableLike$class.head(IterableLike.scala:107)
>   at 
> scala.collection.mutable.WrappedArray.scala$collection$IndexedSeqOptimized$$super$head(WrappedArray.scala:35)
>   at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
>   at scala.collection.mutable.WrappedArray.head(WrappedArray.scala:35)
>   at 
> org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:169)
> {code}
> It would be nice if it returned the empty array, or throwed a more specific 
> exception, and that was documented in ScalaDoc for the method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18086) Regression: Hive variables no longer work in Spark 2.0

2016-11-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630718#comment-15630718
 ] 

Reynold Xin commented on SPARK-18086:
-

The thing is that we don't really propagate Hive's session state over to most 
places, except to the part that connects to the metastore. What we do propagate 
is to send all the settings from SQLConf over to Hadoop Configuration and 
HiveConf.

It seems like this would solve almost all your problems by just implementing 
parsing hiveconf in spark-sql shell.


> Regression: Hive variables no longer work in Spark 2.0
> --
>
> Key: SPARK-18086
> URL: https://issues.apache.org/jira/browse/SPARK-18086
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ryan Blue
>
> The behavior of variables in the SQL shell has changed from 1.6 to 2.0. 
> Specifically, --hivevar name=value and {{SET hivevar:name=value}} no longer 
> work. Queries that worked correctly in 1.6 will either fail or produce 
> unexpected results in 2.0 so I think this is a regression that should be 
> addressed.
> Hive and Spark 1.6 work like this:
> 1. Command-line args --hiveconf and --hivevar can be used to set session 
> properties. --hiveconf properties are added to the Hadoop Configuration.
> 2. {{SET}} adds a Hive Configuration property, {{SET hivevar:=}} 
> adds a Hive var.
> 3. Hive vars can be substituted into queries by name, and Configuration 
> properties can be substituted using {{hiveconf:name}}.
> In 2.0, hiveconf, sparkconf, and conf variable prefixes are all removed, then 
> the value in SQLConf for the rest of the key is returned. SET adds properties 
> to the session config and (according to [a 
> comment|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala#L28])
>  the Hadoop configuration "during I/O".
> {code:title=Hive and Spark 1.6.1 behavior}
> [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2
> spark-sql> select "${hiveconf:test.conf}";
> 1
> spark-sql> select "${test.conf}";
> ${test.conf}
> spark-sql> select "${hivevar:test.var}";
> 2
> spark-sql> select "${test.var}";
> 2
> spark-sql> set test.set=3;
> SET test.set=3
> spark-sql> select "${test.set}"
> "${test.set}"
> spark-sql> select "${hivevar:test.set}"
> "${hivevar:test.set}"
> spark-sql> select "${hiveconf:test.set}"
> 3
> spark-sql> set hivevar:test.setvar=4;
> SET hivevar:test.setvar=4
> spark-sql> select "${hivevar:test.setvar}";
> 4
> spark-sql> select "${test.setvar}";
> 4
> {code}
> {code:title=Spark 2.0.0 behavior}
> [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2
> spark-sql> select "${hiveconf:test.conf}";
> 1
> spark-sql> select "${test.conf}";
> 1
> spark-sql> select "${hivevar:test.var}";
> ${hivevar:test.var}
> spark-sql> select "${test.var}";
> ${test.var}
> spark-sql> set test.set=3;
> test.set3
> spark-sql> select "${test.set}";
> 3
> spark-sql> set hivevar:test.setvar=4;
> hivevar:test.setvar  4
> spark-sql> select "${hivevar:test.setvar}";
> 4
> spark-sql> select "${test.setvar}";
> ${test.setvar}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16726) Improve `Union/Intersect/Except` error messages on incompatible types

2016-11-02 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630657#comment-15630657
 ] 

Dongjoon Hyun commented on SPARK-16726:
---

You're welcome. Thank you, [~nchammas]!

> Improve `Union/Intersect/Except` error messages on incompatible types
> -
>
> Key: SPARK-16726
> URL: https://issues.apache.org/jira/browse/SPARK-16726
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently, `UNION/INTERSECT/EXCEPT` query on incompatible types shows a 
> misleading error message like `unresolved operator Union`. We had better show 
> a more correct message. This will help users in the situation of 
> [SPARK-16704|https://issues.apache.org/jira/browse/SPARK-16704]
> h4. Before
> {code}
> scala> sql("select 1,2,3 union (select 1,array(2),3)")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
> scala> sql("select 1,2,3 intersect (select 1,array(2),3)")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Intersect;
> scala> sql("select 1,2,3 except (select 1,array(2),3)")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Except;
> {code}
> h4. After
> {code}
> scala> sql("select 1,2,3 union (select 1,array(2),3)")
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. The first table has `[IntegerType, 
> IntegerType, IntegerType]` and second table has `[IntegerType, 
> ArrayType(IntegerType,false), IntegerType]`. The 2th column is incompatible;
> scala> sql("select 1,2,3 intersect (select 1,array(2),3)")
> org.apache.spark.sql.AnalysisException: Intersect can only be performed on 
> tables with the compatible column types. The first table has `[IntegerType, 
> IntegerType, IntegerType]` and second table has `[IntegerType, 
> ArrayType(IntegerType,false), IntegerType]`. The 2th column is incompatible;
> scala> sql("select 1,2,3 except (select 1,array(2),3)")
> org.apache.spark.sql.AnalysisException: Except can only be performed on 
> tables with the compatible column types. The first table has `[IntegerType, 
> IntegerType, IntegerType]` and second table has `[IntegerType, 
> ArrayType(IntegerType,false), IntegerType]`. The 2th column is incompatible;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18131) Support returning Vector/Dense Vector from backend

2016-11-02 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630625#comment-15630625
 ] 

Felix Cheung commented on SPARK-18131:
--

See https://issues.apache.org/jira/browse/SPARK-18226


> Support returning Vector/Dense Vector from backend
> --
>
> Key: SPARK-18131
> URL: https://issues.apache.org/jira/browse/SPARK-18131
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Miao Wang
>
> For `spark.logit`, there is a `probabilityCol`, which is a vector in the 
> backend (scala side). When we do collect(select(df, "probabilityCol")), 
> backend returns the java object handle (memory address). We need to implement 
> a method to convert a Vector/Dense Vector column as R vector, which can be 
> read in SparkR. It is a followup JIRA of adding `spark.logit`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18226) SparkR displaying vector columns in incorrect way

2016-11-02 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630623#comment-15630623
 ] 

Felix Cheung commented on SPARK-18226:
--

Thanks, this is actually the issue outlined in 
https://issues.apache.org/jira/browse/SPARK-18131
But it's good to have a concrete example 

> SparkR displaying vector columns in incorrect way
> -
>
> Key: SPARK-18226
> URL: https://issues.apache.org/jira/browse/SPARK-18226
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Trivial
>
> I have encountered a problem with SparkR presenting Spark vectors from 
> org.apache.spark.mllib.linalg package
> * `head(df)` shows in vector column: ""
> * cast to string does not work as expected, it shows: 
> "[1,null,null,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@79f50a91]"
> * `showDF(df)` work correctly
> to reproduce, start SparkR and paste following code (example taken from 
> https://spark.apache.org/docs/latest/sparkr.html#naive-bayes-model)
> {code}
> # Fit a Bernoulli naive Bayes model with spark.naiveBayes
> titanic <- as.data.frame(Titanic)
> titanicDF <- createDataFrame(titanic[titanic$Freq > 0, -5])
> nbDF <- titanicDF
> nbTestDF <- titanicDF
> nbModel <- spark.naiveBayes(nbDF, Survived ~ Class + Sex + Age)
> # Model summary
> summary(nbModel)
> # Prediction
> nbPredictions <- predict(nbModel, nbTestDF)
> #
> # My modification to expose the problem #
> nbPredictions$rawPrediction_str <- cast(nbPredictions$rawPrediction, "string")
> head(nbPredictions)
> showDF(nbPredictions)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16726) Improve `Union/Intersect/Except` error messages on incompatible types

2016-11-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630597#comment-15630597
 ] 

Nicholas Chammas commented on SPARK-16726:
--

I just hit this error in 2.0.1 and it was this JIRA that helped me figure out 
what was going on. Thanks for addressing this issue [~dongjoon]!

> Improve `Union/Intersect/Except` error messages on incompatible types
> -
>
> Key: SPARK-16726
> URL: https://issues.apache.org/jira/browse/SPARK-16726
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently, `UNION/INTERSECT/EXCEPT` query on incompatible types shows a 
> misleading error message like `unresolved operator Union`. We had better show 
> a more correct message. This will help users in the situation of 
> [SPARK-16704|https://issues.apache.org/jira/browse/SPARK-16704]
> h4. Before
> {code}
> scala> sql("select 1,2,3 union (select 1,array(2),3)")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
> scala> sql("select 1,2,3 intersect (select 1,array(2),3)")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Intersect;
> scala> sql("select 1,2,3 except (select 1,array(2),3)")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Except;
> {code}
> h4. After
> {code}
> scala> sql("select 1,2,3 union (select 1,array(2),3)")
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. The first table has `[IntegerType, 
> IntegerType, IntegerType]` and second table has `[IntegerType, 
> ArrayType(IntegerType,false), IntegerType]`. The 2th column is incompatible;
> scala> sql("select 1,2,3 intersect (select 1,array(2),3)")
> org.apache.spark.sql.AnalysisException: Intersect can only be performed on 
> tables with the compatible column types. The first table has `[IntegerType, 
> IntegerType, IntegerType]` and second table has `[IntegerType, 
> ArrayType(IntegerType,false), IntegerType]`. The 2th column is incompatible;
> scala> sql("select 1,2,3 except (select 1,array(2),3)")
> org.apache.spark.sql.AnalysisException: Except can only be performed on 
> tables with the compatible column types. The first table has `[IntegerType, 
> IntegerType, IntegerType]` and second table has `[IntegerType, 
> ArrayType(IntegerType,false), IntegerType]`. The 2th column is incompatible;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18233) Failed to deserialize the task

2016-11-02 Thread Davies Liu (JIRA)
Davies Liu created SPARK-18233:
--

 Summary: Failed to deserialize the task
 Key: SPARK-18233
 URL: https://issues.apache.org/jira/browse/SPARK-18233
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


{code}

16/11/02 18:36:32 ERROR Executor: Exception in task 652.0 in stage 27.0 (TID 
21101)
java.io.InvalidClassException: org.apache.spark.executor.TaskMet; serializable 
and externalizable flags conflict
at java.io.ObjectStreamClass.readNonProxy(ObjectStreamClass.java:698)
at 
java.io.ObjectInputStream.readClassDescriptor(ObjectInputStream.java:831)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1602)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18232) Support Mesos CNI

2016-11-02 Thread Michael Gummelt (JIRA)
Michael Gummelt created SPARK-18232:
---

 Summary: Support Mesos CNI
 Key: SPARK-18232
 URL: https://issues.apache.org/jira/browse/SPARK-18232
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Michael Gummelt


Add the ability to launch containers attached to a CNI network: 
http://mesos.apache.org/documentation/latest/cni/

This allows for user-pluggable network isolation, including IP-per-container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >