[jira] [Assigned] (SPARK-19390) Replace the unnecessary usages of hiveQlTable

2017-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19390:


Assignee: Apache Spark  (was: Xiao Li)

> Replace the unnecessary usages of hiveQlTable
> -
>
> Key: SPARK-19390
> URL: https://issues.apache.org/jira/browse/SPARK-19390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> `catalogTable` is the native table metadata structure for Spark SQL. Thus, we 
> should avoid using Hive's table metadata structure `Table` in our code base. 
> We should remove them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19390) Replace the unnecessary usages of hiveQlTable

2017-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843932#comment-15843932
 ] 

Apache Spark commented on SPARK-19390:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16726

> Replace the unnecessary usages of hiveQlTable
> -
>
> Key: SPARK-19390
> URL: https://issues.apache.org/jira/browse/SPARK-19390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> `catalogTable` is the native table metadata structure for Spark SQL. Thus, we 
> should avoid using Hive's table metadata structure `Table` in our code base. 
> We should remove them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19390) Replace the unnecessary usages of hiveQlTable

2017-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19390:


Assignee: Xiao Li  (was: Apache Spark)

> Replace the unnecessary usages of hiveQlTable
> -
>
> Key: SPARK-19390
> URL: https://issues.apache.org/jira/browse/SPARK-19390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> `catalogTable` is the native table metadata structure for Spark SQL. Thus, we 
> should avoid using Hive's table metadata structure `Table` in our code base. 
> We should remove them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19390) Replace the unnecessary usages of hiveQlTable

2017-01-27 Thread Xiao Li (JIRA)
Xiao Li created SPARK-19390:
---

 Summary: Replace the unnecessary usages of hiveQlTable
 Key: SPARK-19390
 URL: https://issues.apache.org/jira/browse/SPARK-19390
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li


`catalogTable` is the native table metadata structure for Spark SQL. Thus, we 
should avoid using Hive's table metadata structure `Table` in our code base. We 
should remove them.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19377) Killed tasks should have the status as KILLED

2017-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843902#comment-15843902
 ] 

Apache Spark commented on SPARK-19377:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/16725

> Killed tasks should have the status as KILLED
> -
>
> Key: SPARK-19377
> URL: https://issues.apache.org/jira/browse/SPARK-19377
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Reporter: Devaraj K
>Priority: Minor
>
> |143  |10 |0  |SUCCESS|NODE_LOCAL |6 / x.xx.x.x
> stdout
> stderr |2017/01/25 07:49:27   |0 ms   |0.0 B / 0  |0.0 B 
> / 0  |TaskKilled (killed intentionally)|
> |156  |11 |0  |SUCCESS|NODE_LOCAL |5 / x.xx.x.x
> stdout
> stderr |2017/01/25 07:49:27   |0 ms   |0.0 B / 0  |0.0 B 
> / 0  |TaskKilled (killed intentionally)|
> Killed tasks show the task status as SUCCESS, I think we should have the 
> status as KILLED for the killed tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19377) Killed tasks should have the status as KILLED

2017-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19377:


Assignee: (was: Apache Spark)

> Killed tasks should have the status as KILLED
> -
>
> Key: SPARK-19377
> URL: https://issues.apache.org/jira/browse/SPARK-19377
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Reporter: Devaraj K
>Priority: Minor
>
> |143  |10 |0  |SUCCESS|NODE_LOCAL |6 / x.xx.x.x
> stdout
> stderr |2017/01/25 07:49:27   |0 ms   |0.0 B / 0  |0.0 B 
> / 0  |TaskKilled (killed intentionally)|
> |156  |11 |0  |SUCCESS|NODE_LOCAL |5 / x.xx.x.x
> stdout
> stderr |2017/01/25 07:49:27   |0 ms   |0.0 B / 0  |0.0 B 
> / 0  |TaskKilled (killed intentionally)|
> Killed tasks show the task status as SUCCESS, I think we should have the 
> status as KILLED for the killed tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19377) Killed tasks should have the status as KILLED

2017-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19377:


Assignee: Apache Spark

> Killed tasks should have the status as KILLED
> -
>
> Key: SPARK-19377
> URL: https://issues.apache.org/jira/browse/SPARK-19377
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Reporter: Devaraj K
>Assignee: Apache Spark
>Priority: Minor
>
> |143  |10 |0  |SUCCESS|NODE_LOCAL |6 / x.xx.x.x
> stdout
> stderr |2017/01/25 07:49:27   |0 ms   |0.0 B / 0  |0.0 B 
> / 0  |TaskKilled (killed intentionally)|
> |156  |11 |0  |SUCCESS|NODE_LOCAL |5 / x.xx.x.x
> stdout
> stderr |2017/01/25 07:49:27   |0 ms   |0.0 B / 0  |0.0 B 
> / 0  |TaskKilled (killed intentionally)|
> Killed tasks show the task status as SUCCESS, I think we should have the 
> status as KILLED for the killed tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-01-27 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843884#comment-15843884
 ] 

zhengruifeng commented on SPARK-19208:
--

[~josephkb] I have considered of the analogy of Double column stats.
But there is a small difference: Some temporary intermediate variables are used 
by multi metrics.

{code}
val results: DataFrame = df.select(VectorSummary.mean("features"), 
VectorSummary.variance("features"))
{code}

The {{currMean}} and {{weightSum}} are used both in {{VectorSummary.mean}} and 
{{VectorSummary.variance}}. So we maybe have to compute {{currMean}} and 
{{weightSum}} twice, if we use two seperate udaf.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-01-27 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843809#comment-15843809
 ] 

Takeshi Yamamuro commented on SPARK-19104:
--

I think this issue is related to SPARK-18891

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Nils Grabbert
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>  
>   at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) 
>   at 

[jira] [Commented] (SPARK-16636) Missing documentation for CalendarIntervalType type in sql-programming-guide.md

2017-01-27 Thread Artem Stasiuk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843801#comment-15843801
 ] 

Artem Stasiuk commented on SPARK-16636:
---

I'm going to take a look on that one

> Missing documentation for CalendarIntervalType type in 
> sql-programming-guide.md
> ---
>
> Key: SPARK-16636
> URL: https://issues.apache.org/jira/browse/SPARK-16636
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> I just noticed that actually there is {{CalendarIntervalType}} but this is 
> missing in 
> [sql-programming-guide.md|https://github.com/apache/spark/blob/1426a080528bdb470b5e81300d892af45dd188bf/docs/sql-programming-guide.md].
> This type was initially added in SPARK-8753 but renamed to 
> {{CalendarIntervalType}} in SPARK-9430.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19359) partition path created by Hive should be deleted after rename a partition with upper-case

2017-01-27 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19359.
-
   Resolution: Fixed
 Assignee: Song Jun
Fix Version/s: 2.2.0

> partition path created by Hive should be deleted after rename a partition 
> with upper-case
> -
>
> Key: SPARK-19359
> URL: https://issues.apache.org/jira/browse/SPARK-19359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Song Jun
>Assignee: Song Jun
>Priority: Minor
> Fix For: 2.2.0
>
>
> Hive metastore is not case preserving and keep partition columns with lower 
> case names. 
> If SparkSQL create a table with upper-case partion name use 
> HiveExternalCatalog, when we rename partition, it first call the HiveClient 
> to renamePartition, which will create a new lower case partition path, then 
> SparkSql rename the lower case path to the upper-case.
> while if the renamed partition contains more than one depth partition ,e.g. 
> A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to 
> A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19352) Sorting issues on relatively big datasets

2017-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843736#comment-15843736
 ] 

Apache Spark commented on SPARK-19352:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/16724

> Sorting issues on relatively big datasets
> -
>
> Key: SPARK-19352
> URL: https://issues.apache.org/jira/browse/SPARK-19352
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: Spark version 2.1.0
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
> macOS 10.12.3
>Reporter: Ivan Gozali
>
> _More details, including the script to generate the synthetic dataset 
> (requires pandas and numpy) are in this GitHub gist._
> https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f
> Given a relatively large synthetic time series dataset of various users 
> (4.1GB), when attempting to:
> * partition this dataset by user ID
> * sort the time series data for each user by timestamp
> * write each partition to a single CSV file
> then some files are unsorted in a very specific manner. In one of the 
> supposedly sorted files, the rows looked as follows:
> {code}
> 2014-01-01T00:00:00.000-08:00,-0.07,0.39,-0.39
> 2014-12-31T02:07:30.000-08:00,0.34,-0.62,-0.22
> 2014-01-01T00:00:05.000-08:00,-0.07,-0.52,0.47
> 2014-12-31T02:07:35.000-08:00,-0.15,-0.13,-0.14
> 2014-01-01T00:00:10.000-08:00,-1.31,-1.17,2.24
> 2014-12-31T02:07:40.000-08:00,-1.28,0.88,-0.43
> {code}
> The above is attempted using the following Scala/Spark code:
> {code}
> val inpth = "/tmp/gen_data_3cols_small"
> spark
> .read
> .option("inferSchema", "true")
> .option("header", "true")
> .csv(inpth)
> .repartition($"userId")
> .sortWithinPartitions("timestamp")
> .write
> .partitionBy("userId")
> .option("header", "true")
> .csv(inpth + "_sorted")
> {code}
> This issue is not seen when using a smaller sized dataset by making the time 
> span smaller (354MB, with the same number of columns).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19352) Sorting issues on relatively big datasets

2017-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19352:


Assignee: Apache Spark

> Sorting issues on relatively big datasets
> -
>
> Key: SPARK-19352
> URL: https://issues.apache.org/jira/browse/SPARK-19352
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: Spark version 2.1.0
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
> macOS 10.12.3
>Reporter: Ivan Gozali
>Assignee: Apache Spark
>
> _More details, including the script to generate the synthetic dataset 
> (requires pandas and numpy) are in this GitHub gist._
> https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f
> Given a relatively large synthetic time series dataset of various users 
> (4.1GB), when attempting to:
> * partition this dataset by user ID
> * sort the time series data for each user by timestamp
> * write each partition to a single CSV file
> then some files are unsorted in a very specific manner. In one of the 
> supposedly sorted files, the rows looked as follows:
> {code}
> 2014-01-01T00:00:00.000-08:00,-0.07,0.39,-0.39
> 2014-12-31T02:07:30.000-08:00,0.34,-0.62,-0.22
> 2014-01-01T00:00:05.000-08:00,-0.07,-0.52,0.47
> 2014-12-31T02:07:35.000-08:00,-0.15,-0.13,-0.14
> 2014-01-01T00:00:10.000-08:00,-1.31,-1.17,2.24
> 2014-12-31T02:07:40.000-08:00,-1.28,0.88,-0.43
> {code}
> The above is attempted using the following Scala/Spark code:
> {code}
> val inpth = "/tmp/gen_data_3cols_small"
> spark
> .read
> .option("inferSchema", "true")
> .option("header", "true")
> .csv(inpth)
> .repartition($"userId")
> .sortWithinPartitions("timestamp")
> .write
> .partitionBy("userId")
> .option("header", "true")
> .csv(inpth + "_sorted")
> {code}
> This issue is not seen when using a smaller sized dataset by making the time 
> span smaller (354MB, with the same number of columns).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19352) Sorting issues on relatively big datasets

2017-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19352:


Assignee: (was: Apache Spark)

> Sorting issues on relatively big datasets
> -
>
> Key: SPARK-19352
> URL: https://issues.apache.org/jira/browse/SPARK-19352
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: Spark version 2.1.0
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
> macOS 10.12.3
>Reporter: Ivan Gozali
>
> _More details, including the script to generate the synthetic dataset 
> (requires pandas and numpy) are in this GitHub gist._
> https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f
> Given a relatively large synthetic time series dataset of various users 
> (4.1GB), when attempting to:
> * partition this dataset by user ID
> * sort the time series data for each user by timestamp
> * write each partition to a single CSV file
> then some files are unsorted in a very specific manner. In one of the 
> supposedly sorted files, the rows looked as follows:
> {code}
> 2014-01-01T00:00:00.000-08:00,-0.07,0.39,-0.39
> 2014-12-31T02:07:30.000-08:00,0.34,-0.62,-0.22
> 2014-01-01T00:00:05.000-08:00,-0.07,-0.52,0.47
> 2014-12-31T02:07:35.000-08:00,-0.15,-0.13,-0.14
> 2014-01-01T00:00:10.000-08:00,-1.31,-1.17,2.24
> 2014-12-31T02:07:40.000-08:00,-1.28,0.88,-0.43
> {code}
> The above is attempted using the following Scala/Spark code:
> {code}
> val inpth = "/tmp/gen_data_3cols_small"
> spark
> .read
> .option("inferSchema", "true")
> .option("header", "true")
> .csv(inpth)
> .repartition($"userId")
> .sortWithinPartitions("timestamp")
> .write
> .partitionBy("userId")
> .option("header", "true")
> .csv(inpth + "_sorted")
> {code}
> This issue is not seen when using a smaller sized dataset by making the time 
> span smaller (354MB, with the same number of columns).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19389) Minor doc fixes, including Since tags in Python Params

2017-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843699#comment-15843699
 ] 

Apache Spark commented on SPARK-19389:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/16723

> Minor doc fixes, including Since tags in Python Params
> --
>
> Key: SPARK-19389
> URL: https://issues.apache.org/jira/browse/SPARK-19389
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> I spotted some doc issues, mainly in Python, when reviewing [SPARK-19336] 
> which were not related to that PR.  This PR fixes them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19389) Minor doc fixes, including Since tags in Python Params

2017-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19389:


Assignee: Apache Spark  (was: Joseph K. Bradley)

> Minor doc fixes, including Since tags in Python Params
> --
>
> Key: SPARK-19389
> URL: https://issues.apache.org/jira/browse/SPARK-19389
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> I spotted some doc issues, mainly in Python, when reviewing [SPARK-19336] 
> which were not related to that PR.  This PR fixes them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19389) Minor doc fixes, including Since tags in Python Params

2017-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19389:


Assignee: Joseph K. Bradley  (was: Apache Spark)

> Minor doc fixes, including Since tags in Python Params
> --
>
> Key: SPARK-19389
> URL: https://issues.apache.org/jira/browse/SPARK-19389
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> I spotted some doc issues, mainly in Python, when reviewing [SPARK-19336] 
> which were not related to that PR.  This PR fixes them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19247) Improve ml word2vec save/load scalability

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19247:
--
Component/s: ML

> Improve ml word2vec save/load scalability
> -
>
> Key: SPARK-19247
> URL: https://issues.apache.org/jira/browse/SPARK-19247
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Asher Krim
>
> ml word2vec models can be somewhat large (~4gb is not uncommon). The current 
> save implementation saves the model as a single large datum, which can cause 
> rpc issues and fail to save the model.
> On the loading side, there are issues with loading this large datum as well. 
> This was already solved for mllib word2vec in 
> https://issues.apache.org/jira/browse/SPARK-11994, but the change was never 
> ported to the ml word2vec implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19294) improve LocalLDAModel save/load scaling for large models

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19294:
--
Component/s: ML

> improve LocalLDAModel save/load scaling for large models
> 
>
> Key: SPARK-19294
> URL: https://issues.apache.org/jira/browse/SPARK-19294
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Asher Krim
>
> The LDA model in ml has some of the same problems addressed by 
> https://issues.apache.org/jira/browse/SPARK-19247 for word2vec.
> An LDA model is on order of `vocabSize` * `k`, which can easily reach 3gb for 
> k=1000 and vocabSize=3m. It's currently saved as a single datum in 1 
> partition. 
> Instead, we should represent the matrix as a list, and use the logic from 
> https://issues.apache.org/jira/browse/SPARK-11994 to pick a reasonable number 
> of partitions.
> cc [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19247) Improve ml word2vec save/load scalability

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19247:
--
Summary: Improve ml word2vec save/load scalability  (was: improve ml 
word2vec save/load)

> Improve ml word2vec save/load scalability
> -
>
> Key: SPARK-19247
> URL: https://issues.apache.org/jira/browse/SPARK-19247
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Asher Krim
>
> ml word2vec models can be somewhat large (~4gb is not uncommon). The current 
> save implementation saves the model as a single large datum, which can cause 
> rpc issues and fail to save the model.
> On the loading side, there are issues with loading this large datum as well. 
> This was already solved for mllib word2vec in 
> https://issues.apache.org/jira/browse/SPARK-11994, but the change was never 
> ported to the ml word2vec implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19247) Improve ml word2vec save/load scalability

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19247:
--
Issue Type: Improvement  (was: Bug)

> Improve ml word2vec save/load scalability
> -
>
> Key: SPARK-19247
> URL: https://issues.apache.org/jira/browse/SPARK-19247
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Asher Krim
>
> ml word2vec models can be somewhat large (~4gb is not uncommon). The current 
> save implementation saves the model as a single large datum, which can cause 
> rpc issues and fail to save the model.
> On the loading side, there are issues with loading this large datum as well. 
> This was already solved for mllib word2vec in 
> https://issues.apache.org/jira/browse/SPARK-11994, but the change was never 
> ported to the ml word2vec implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19294) improve LocalLDAModel save/load scaling for large models

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19294:
--
Summary: improve LocalLDAModel save/load scaling for large models  (was: 
improve ml LDA save/load)

> improve LocalLDAModel save/load scaling for large models
> 
>
> Key: SPARK-19294
> URL: https://issues.apache.org/jira/browse/SPARK-19294
> Project: Spark
>  Issue Type: Bug
>Reporter: Asher Krim
>
> The LDA model in ml has some of the same problems addressed by 
> https://issues.apache.org/jira/browse/SPARK-19247 for word2vec.
> An LDA model is on order of `vocabSize` * `k`, which can easily reach 3gb for 
> k=1000 and vocabSize=3m. It's currently saved as a single datum in 1 
> partition. 
> Instead, we should represent the matrix as a list, and use the logic from 
> https://issues.apache.org/jira/browse/SPARK-11994 to pick a reasonable number 
> of partitions.
> cc [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19294) improve LocalLDAModel save/load scaling for large models

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19294:
--
Issue Type: Improvement  (was: Bug)

> improve LocalLDAModel save/load scaling for large models
> 
>
> Key: SPARK-19294
> URL: https://issues.apache.org/jira/browse/SPARK-19294
> Project: Spark
>  Issue Type: Improvement
>Reporter: Asher Krim
>
> The LDA model in ml has some of the same problems addressed by 
> https://issues.apache.org/jira/browse/SPARK-19247 for word2vec.
> An LDA model is on order of `vocabSize` * `k`, which can easily reach 3gb for 
> k=1000 and vocabSize=3m. It's currently saved as a single datum in 1 
> partition. 
> Instead, we should represent the matrix as a list, and use the logic from 
> https://issues.apache.org/jira/browse/SPARK-11994 to pick a reasonable number 
> of partitions.
> cc [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19389) Minor doc fixes, including Since tags in Python Params

2017-01-27 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-19389:
-

 Summary: Minor doc fixes, including Since tags in Python Params
 Key: SPARK-19389
 URL: https://issues.apache.org/jira/browse/SPARK-19389
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML, PySpark
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor


I spotted some doc issues, mainly in Python, when reviewing [SPARK-19336] which 
were not related to that PR.  This PR fixes them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19336) LinearSVC Python API

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19336.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16694
[https://github.com/apache/spark/pull/16694]

> LinearSVC Python API
> 
>
> Key: SPARK-19336
> URL: https://issues.apache.org/jira/browse/SPARK-19336
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Miao Wang
> Fix For: 2.2.0
>
>
> Create a Python wrapper for spark.ml.classification.LinearSVC



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2017-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843675#comment-15843675
 ] 

Apache Spark commented on SPARK-9478:
-

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/16722

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-01-27 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843670#comment-15843670
 ] 

Joseph K. Bradley commented on SPARK-19208:
---

Thanks for writing out your ideas.  Here are my thoughts about the API:

*Reference API: Double column stats*
When working with Double columns (not Vectors), one would expect write things 
like: {{myDataFrame.select(min("x"), max("x"))}} to select 2 stats, min and 
max.  Here, min and max are functions provided by Spark SQL which return 
columns.

*Analogy*
We should probably provide an analogous API.  Here's what I imagine:
{code}
import org.apache.spark.ml.stat.VectorSummary
val df: DataFrame = ...

val results: DataFrame = df.select(VectorSummary.min("features"), 
VectorSummary.mean("features"))
val weightedResults: DataFrame = df.select(VectorSummary.min("features"), 
VectorSummary.mean("features", "weight"))
// Both of these result DataFrames contain 2 Vector columns.
{code}

I.e., we provide vectorized versions of stats functions.

If you want to put everything into a single function, then we could also have 
VectorSummary have a function "summary" which returns a struct type with every 
stat available:
{code}
val results = df.select(VectorSummary.summary("features", "weights"))
// results DataFrame contains 1 struct column, which has a Vector field for 
every statistic we provide.
{code}

Note: I removed "online" from the name since it the user does not need to know 
that it does online aggregation.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-01-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19208:
--
Summary: MultivariateOnlineSummarizer performance optimization  (was: 
MultivariateOnlineSummarizer perfermence optimization)

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19365) Optimize RequestMessage serialization

2017-01-27 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19365.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> Optimize RequestMessage serialization
> -
>
> Key: SPARK-19365
> URL: https://issues.apache.org/jira/browse/SPARK-19365
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>
> Right now Netty PRC serializes RequestMessage using Java serialization, and 
> the size of a single message (e.g., RequestMessage(..., "hello!")`) is about 
> 1kb.
> This PR optimizes it by serializing RequestMessage manually, and reduces the 
> above message size to 100+ bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19388) Reading an empty folder as parquet causes an Analysis Exception

2017-01-27 Thread Franklyn Dsouza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza closed SPARK-19388.
---
Resolution: Fixed

> Reading an empty folder as parquet causes an Analysis Exception
> ---
>
> Key: SPARK-19388
> URL: https://issues.apache.org/jira/browse/SPARK-19388
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Franklyn Dsouza
>Priority: Minor
>
> Reading an empty folder as parquet used to return an empty dataframe up till 
> 2.0 .
> Now this causes an analysis exception like so 
> {code}
> In [1]: df = sc.sql.read.parquet("empty_dir/")
> ---
> AnalysisException Traceback (most recent call last)
> > 1 df = sqlCtx.read.parquet("empty_dir/")
> spark/99f3dfa6151e312379a7381b7e65637df0429941/python/pyspark/sql/readwriter.pyc
>  in parquet(self, *paths)
> 272 [('name', 'string'), ('year', 'int'), ('month', 'int'), 
> ('day', 'int')]
> 273 """
> --> 274 return 
> self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
> 275
> 276 @ignore_unicode_prefix
> park/99f3dfa6151e312379a7381b7e65637df0429941/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> spark/99f3dfa6151e312379a7381b7e65637df0429941/python/pyspark/sql/utils.pyc 
> in deco(*a, **kw)
>  67  
> e.java_exception.getStackTrace()))
>  68 if s.startswith('org.apache.spark.sql.AnalysisException: 
> '):
> ---> 69 raise AnalysisException(s.split(': ', 1)[1], 
> stackTrace)
>  70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
>  71 raise AnalysisException(s.split(': ', 1)[1], 
> stackTrace)
> AnalysisException: u'Unable to infer schema for Parquet. It must be specified 
> manually.;'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19388) Reading an empty folder as parquet causes an Analysis Exception

2017-01-27 Thread Franklyn Dsouza (JIRA)
Franklyn Dsouza created SPARK-19388:
---

 Summary: Reading an empty folder as parquet causes an Analysis 
Exception
 Key: SPARK-19388
 URL: https://issues.apache.org/jira/browse/SPARK-19388
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: Franklyn Dsouza
Priority: Minor


Reading an empty folder as parquet used to return an empty dataframe up till 
2.0 .

Now this causes an analysis exception like so 

{code}
In [1]: df = sc.sql.read.parquet("empty_dir/")
---
AnalysisException Traceback (most recent call last)
> 1 df = sqlCtx.read.parquet("empty_dir/")

spark/99f3dfa6151e312379a7381b7e65637df0429941/python/pyspark/sql/readwriter.pyc
 in parquet(self, *paths)
272 [('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 
'int')]
273 """
--> 274 return self._df(self._jreader.parquet(_to_seq(self._spark._sc, 
paths)))
275
276 @ignore_unicode_prefix

park/99f3dfa6151e312379a7381b7e65637df0429941/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134
   1135 for temp_arg in temp_args:

spark/99f3dfa6151e312379a7381b7e65637df0429941/python/pyspark/sql/utils.pyc in 
deco(*a, **kw)
 67  
e.java_exception.getStackTrace()))
 68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
 70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
 71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: u'Unable to infer schema for Parquet. It must be specified 
manually.;'
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12240) FileNotFoundException: (Too many open files) when using multiple groupby on DataFrames

2017-01-27 Thread Shawn Lavelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843549#comment-15843549
 ] 

Shawn Lavelle commented on SPARK-12240:
---

Hi [~Wisely Chen] ,
  Do you have any guidance on how high to set this limit?
  What are folks to do who don't have permission to change the ulimit?

> FileNotFoundException: (Too many open files) when using multiple groupby on 
> DataFrames
> --
>
> Key: SPARK-12240
> URL: https://issues.apache.org/jira/browse/SPARK-12240
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
> Environment: Debian 3.2.68-1+deb7u6 x86_64 GNU/Linux
>Reporter: Shubhanshu Mishra
>  Labels: dataframe, grouping, pyspark
>
> Whenever, I try to do multiple grouping using data frames my job crashes with 
> the error FileNotFoundException and message  = too many open files. 
> I can do these groupings using RDD easily but when I use the DataFrame 
> operation I see these issues. 
> The code I am running:
> ```
> df_t = df.filter(df['max_cum_rank'] == 
> 0).select(['col1','col2']).groupby('col1').agg(F.min('col2')).groupby('min(col2)').agg(F.countDistinct('col1')).toPandas()
> ```
> In [151]: df_t = df.filter(df['max_cum_rank'] == 
> 0).select(['col1','col2']).groupby('col1').agg(F.min('col2')).groupby('min(col2)').agg(F.countDistinct('col1')).toPandas()
> [Stage 27:=>(415 + 1) / 
> 416]15/12/09 06:36:36 ERROR DiskBlockObjectWriter: Uncaught exception while 
> reverting partial writes to file 
> /path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/22/temp_shuffle_1abbf917-842c-41ef-b113-ed60ee22e675
> java.io.FileNotFoundException: 
> /path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/22/temp_shuffle_1abbf917-842c-41ef-b113-ed60ee22e675
>  (Too many open files)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:221)
> at 
> org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:160)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:174)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.stop(SortShuffleWriter.scala:104)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 15/12/09 06:36:36 ERROR DiskBlockObjectWriter: Uncaught exception while 
> reverting partial writes to file 
> /path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/29/temp_shuffle_e35e6e28-fdbf-4775-a32d-d0f5fd882e9e
> java.io.FileNotFoundException: 
> /path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/29/temp_shuffle_e35e6e28-fdbf-4775-a32d-d0f5fd882e9e
>  (Too many open files)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:221)
> at 
> org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:160)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:174)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.stop(SortShuffleWriter.scala:104)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 15/12/09 06:36:36 ERROR DiskBlockObjectWriter: Uncaught exception while 
> reverting partial writes to file 
> /path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/18/temp_shuffle_2d26adcb-e3bb-4a01-8998-7428ebe5544d
> java.io.FileNotFoundException: 
> /path/tmp/blockmgr-fde0f618-e443-4841-96c4-54c5e5b8fa0f/18/temp_shuffle_2d26adcb-e3bb-4a01-8998-7428ebe5544d
>  (Too many open files)
> at java.io.FileOutputStream.open(Native Method)
> 

[jira] [Commented] (SPARK-19387) CRAN tests do not run with SparkR source package

2017-01-27 Thread Brendan Dwyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843493#comment-15843493
 ] 

Brendan Dwyer commented on SPARK-19387:
---

Should this be under [SPARK-15799]?

> CRAN tests do not run with SparkR source package
> 
>
> Key: SPARK-19387
> URL: https://issues.apache.org/jira/browse/SPARK-19387
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> It looks like sparkR.session() is not installing Spark - as a result, running 
> R CMD check --as-cran SparkR_*.tar.gz fails, blocking possible submission to 
> CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19324) JVM stdout output is dropped in SparkR

2017-01-27 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-19324.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Resolved by https://github.com/apache/spark/pull/16670

> JVM stdout output is dropped in SparkR
> --
>
> Key: SPARK-19324
> URL: https://issues.apache.org/jira/browse/SPARK-19324
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.1, 2.2.0
>
>
> Whenever there are stdout outputs from Spark in JVM (typically when calling 
> println()) they are dropped by SparkR.
> For example, explain() for Column
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19387) CRAN tests do not run with SparkR source package

2017-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19387:


Assignee: Felix Cheung  (was: Apache Spark)

> CRAN tests do not run with SparkR source package
> 
>
> Key: SPARK-19387
> URL: https://issues.apache.org/jira/browse/SPARK-19387
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> It looks like sparkR.session() is not installing Spark - as a result, running 
> R CMD check --as-cran SparkR_*.tar.gz fails, blocking possible submission to 
> CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19387) CRAN tests do not run with SparkR source package

2017-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843347#comment-15843347
 ] 

Apache Spark commented on SPARK-19387:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16720

> CRAN tests do not run with SparkR source package
> 
>
> Key: SPARK-19387
> URL: https://issues.apache.org/jira/browse/SPARK-19387
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> It looks like sparkR.session() is not installing Spark - as a result, running 
> R CMD check --as-cran SparkR_*.tar.gz fails, blocking possible submission to 
> CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19387) CRAN tests do not run with SparkR source package

2017-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19387:


Assignee: Apache Spark  (was: Felix Cheung)

> CRAN tests do not run with SparkR source package
> 
>
> Key: SPARK-19387
> URL: https://issues.apache.org/jira/browse/SPARK-19387
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>
> It looks like sparkR.session() is not installing Spark - as a result, running 
> R CMD check --as-cran SparkR_*.tar.gz fails, blocking possible submission to 
> CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19380) YARN - Dynamic allocation should use configured number of executors as max number of executors

2017-01-27 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843340#comment-15843340
 ] 

Min Shen commented on SPARK-19380:
--

[~srowen],

What we want is to be able to also cap the number of executors when user 
explicitly specify the number of executors when dynamic allocation is enabled.
Instead of having the number executors growing and shrinking between 
minExecutors and maxExecutors, we want to restrict it between minExecutors and 
the number of Executors requested by the user.

We see a few benefits with this approach:
* When we start enabling dynamic allocation with this additional constraint, 
the Spark applications' resource demands do not increase significantly all of a 
sudden. If the maxExecutor is set to be 900 when most Spark applications set 
num-executors to 200-300, the default behavior could result into these Spark 
applications to start requesting for even more executors increasing the 
resource contention in the cluster especially if there are long running stages 
in these Spark applications.
* Certain users are expecting their Spark application to be launched with a 
given number of executors, because they want to control how much data they 
cache on each executor etc. The default behavior will start these user's Spark 
application with the requested num-executors and request for more when tasks 
are backed up. This will change the user's expectations when we enable dynamic 
allocation.

> YARN - Dynamic allocation should use configured number of executors as max 
> number of executors
> --
>
> Key: SPARK-19380
> URL: https://issues.apache.org/jira/browse/SPARK-19380
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.3
>Reporter: Zhe Zhang
>
>  SPARK-13723 only uses user's number of executors as the initial number of 
> executors when dynamic allocation is turned on.
> If the configured max number of executors is larger than the number of 
> executors requested by the user, user's application could continue to request 
> for more executors to reach the configured max number if there're tasks 
> backed up. This behavior is not very friendly to the cluster if we allow 
> every Spark application to reach the max number of executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19387) CRAN tests do not run with SparkR source package

2017-01-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19387:
-
Description: 
It looks like sparkR.session() is not installing Spark - as a result, running R 
CMD check --as-cran SparkR_*.tar.gz fails, blocking possible submission to CRAN.


> CRAN tests do not run with SparkR source package
> 
>
> Key: SPARK-19387
> URL: https://issues.apache.org/jira/browse/SPARK-19387
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> It looks like sparkR.session() is not installing Spark - as a result, running 
> R CMD check --as-cran SparkR_*.tar.gz fails, blocking possible submission to 
> CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19387) CRAN tests do not run with SparkR source package

2017-01-27 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19387:


 Summary: CRAN tests do not run with SparkR source package
 Key: SPARK-19387
 URL: https://issues.apache.org/jira/browse/SPARK-19387
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung
Assignee: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19386) Bisecting k-means in SparkR documentation

2017-01-27 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19386:


 Summary: Bisecting k-means in SparkR documentation
 Key: SPARK-19386
 URL: https://issues.apache.org/jira/browse/SPARK-19386
 Project: Spark
  Issue Type: Bug
  Components: ML, SparkR
Affects Versions: 2.2.0
Reporter: Felix Cheung
Assignee: Miao Wang


we need updates to programming guide, example and vignettes




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19333) Files out of compliance with ASF policy

2017-01-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19333:
-
Affects Version/s: 2.0.0
   2.1.0

> Files out of compliance with ASF policy
> ---
>
> Key: SPARK-19333
> URL: https://issues.apache.org/jira/browse/SPARK-19333
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: John D. Ament
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> ASF policy is that source files include our headers
> http://www.apache.org/legal/release-policy.html#license-headers
> However, there are a few files in spark's release that are missing headers.  
> this is not exhaustive
> https://github.com/apache/spark/blob/master/R/pkg/DESCRIPTION
> https://github.com/apache/spark/blob/master/R/pkg/NAMESPACE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19333) Files out of compliance with ASF policy

2017-01-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19333:
-
Affects Version/s: 2.0.1
   2.0.2

> Files out of compliance with ASF policy
> ---
>
> Key: SPARK-19333
> URL: https://issues.apache.org/jira/browse/SPARK-19333
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: John D. Ament
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> ASF policy is that source files include our headers
> http://www.apache.org/legal/release-policy.html#license-headers
> However, there are a few files in spark's release that are missing headers.  
> this is not exhaustive
> https://github.com/apache/spark/blob/master/R/pkg/DESCRIPTION
> https://github.com/apache/spark/blob/master/R/pkg/NAMESPACE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19333) Files out of compliance with ASF policy

2017-01-27 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19333.
--
  Resolution: Fixed
   Fix Version/s: 2.2.0
  2.1.1
  2.0.3
Target Version/s: 2.0.3, 2.1.1, 2.2.0

> Files out of compliance with ASF policy
> ---
>
> Key: SPARK-19333
> URL: https://issues.apache.org/jira/browse/SPARK-19333
> Project: Spark
>  Issue Type: Improvement
>Reporter: John D. Ament
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> ASF policy is that source files include our headers
> http://www.apache.org/legal/release-policy.html#license-headers
> However, there are a few files in spark's release that are missing headers.  
> this is not exhaustive
> https://github.com/apache/spark/blob/master/R/pkg/DESCRIPTION
> https://github.com/apache/spark/blob/master/R/pkg/NAMESPACE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19220) SSL redirect handler only redirects the server's root

2017-01-27 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-19220:
---
Fix Version/s: 2.0.3

> SSL redirect handler only redirects the server's root
> -
>
> Key: SPARK-19220
> URL: https://issues.apache.org/jira/browse/SPARK-19220
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> The redirect handler that is started in the HTTP port when SSL is enabled 
> only redirects the root of the server. Additional handlers do not go through 
> the handler, so if you have a deep link to the non-https server, you won't be 
> redirected to the https port.
> I tested this with the history server, but it should be the same for the 
> normal UI; the fix should be the same for both too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19369) SparkConf not getting properly initialized in PySpark 2.1.0

2017-01-27 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843199#comment-15843199
 ] 

Marcelo Vanzin commented on SPARK-19369:


Don't really know; but it doesn't take too long, generally, after a .0 release. 
After Spark Summit would be my guess.

> SparkConf not getting properly initialized in PySpark 2.1.0
> ---
>
> Key: SPARK-19369
> URL: https://issues.apache.org/jira/browse/SPARK-19369
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Windows/Linux
>Reporter: Sidney Feiner
>  Labels: configurations, context, pyspark
>
> Trying to migrate from Spark 1.6 to 2.1, I've stumbled upon a small problem - 
> my SparkContext doesn't get its configurations from the SparkConf object. 
> Before passing them onto to the SparkContext constructor, I've made sure my 
> configuration are set.
> I've done some digging and this is what I've found:
> When I initialize the SparkContext, the following code is executed:
> def _do_init(self, master, appName, sparkHome, pyFiles, environment, 
> batchSize, serializer,
>  conf, jsc, profiler_cls):
> self.environment = environment or {}
> if conf is not None and conf._jconf is not None:
>self._conf = conf
> else:
> self._conf = SparkConf(_jvm=SparkContext._jvm)
> So I can see that the only way that my SparkConf will be used is if it also 
> has a _jvm object.
> I've used spark-submit to submit my job and printed the _jvm object but it is 
> null, which explains why my SparkConf object is ignored.
> I've tried running exactly the same on Spark 2.0.1 and it worked! My 
> SparkConf object had a valid _jvm object.
> Am i doing something wrong or is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19385) During canonicalization, `NOT(l, r)` should not expect such cases that l.hashcode > r.hashcode

2017-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842993#comment-15842993
 ] 

Apache Spark commented on SPARK-19385:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16719

> During canonicalization, `NOT(l, r)` should not expect such cases that 
> l.hashcode > r.hashcode
> --
>
> Key: SPARK-19385
> URL: https://issues.apache.org/jira/browse/SPARK-19385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Liwei Lin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19385) During canonicalization, `NOT(l, r)` should not expect such cases that l.hashcode > r.hashcode

2017-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19385:


Assignee: (was: Apache Spark)

> During canonicalization, `NOT(l, r)` should not expect such cases that 
> l.hashcode > r.hashcode
> --
>
> Key: SPARK-19385
> URL: https://issues.apache.org/jira/browse/SPARK-19385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Liwei Lin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19385) During canonicalization, `NOT(l, r)` should not expect such cases that l.hashcode > r.hashcode

2017-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19385:


Assignee: Apache Spark

> During canonicalization, `NOT(l, r)` should not expect such cases that 
> l.hashcode > r.hashcode
> --
>
> Key: SPARK-19385
> URL: https://issues.apache.org/jira/browse/SPARK-19385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Liwei Lin
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19385) During canonicalization, `NOT(l, r)` should not expect such cases that l.hashcode > r.hashcode

2017-01-27 Thread Liwei Lin (JIRA)
Liwei Lin created SPARK-19385:
-

 Summary: During canonicalization, `NOT(l, r)` should not expect 
such cases that l.hashcode > r.hashcode
 Key: SPARK-19385
 URL: https://issues.apache.org/jira/browse/SPARK-19385
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0
Reporter: Liwei Lin
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19383) Spark Sql Fails with Cassandra 3.6 and later PER PARTITION LIMIT option

2017-01-27 Thread Brent Dorsey (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842893#comment-15842893
 ] 

Brent Dorsey commented on SPARK-19383:
--

I agree it’s not Hive syntax.  It’s a power Cassandra CQL option and I love 
Spark so I was hoping you’d be interested in expanding your support of 
Cassandra with Spark Sql.

 

 

From: "Herman van Hovell (JIRA)" 
Date: Friday, January 27, 2017 at 4:39 AM
To: 
Subject: [jira] [Comment Edited] (SPARK-19383) Spark Sql Fails with Cassandra 
3.6 and later PER PARTITION LIMIT option

 

 

[ 
https://issues.apache.org/jira/browse/SPARK-19383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842524#comment-15842524
 ] 

 

Herman van Hovell edited comment on SPARK-19383 at 1/27/17 10:39 AM:

-

 

This is definitely not Hive/Spark syntax and something we are going to support. 
Also see [~rspitzer]'s answer on your stack overflow question 
(http://stackoverflow.com/questions/41887041/spark-cassandra-connector-per-partition-limit).
 Closing as not a problem.

 

 

was (Author: hvanhovell):

This is definitely not Hive/Spark syntax and something we are going to support. 
Also see [~rspitzer]'s answer on your stack overflow question 
(http://stackoverflow.com/questions/41887041/spark-cassandra-connector-per-partition-limit).
 Closing as a won't fix.

 

Spark Sql Fails with Cassandra 3.6 and later PER PARTITION LIMIT option 



 

 Key: SPARK-19383

 URL: https://issues.apache.org/jira/browse/SPARK-19383

 Project: Spark

  Issue Type: Bug

  Components: SQL

Affects Versions: 2.0.2

 Environment: PER PARTITION LIMIT Error documented in github and 
reproducible by cloning: 
[BrentDorsey/cassandra-spark-job|https://github.com/BrentDorsey/cassandra-spark-job]

Java 1.8

Cassandra Version

[cqlsh 5.0.1 | Cassandra 3.9.0 | CQL spec 3.4.2 | Native protocol v4]

{code:title=POM.xml|borderStyle=solid}



 com.datastax.spark

 spark-cassandra-connector_2.10

 2.0.0-M3

 

 

 com.datastax.cassandra

 cassandra-driver-mapping

 3.1.2

 

 

 org.apache.hadoop

 hadoop-common

 2.72

 compile

 

 

 org.apache.spark

 spark-catalyst_2.10

 2.0.2

 compile

 

 

 org.apache.spark

 spark-core_2.10

 2.0.2

 compile

 

 

 org.apache.spark

 spark-sql_2.10

 2.0.2

 compile

 

{code}

Reporter: Brent Dorsey

Priority: Minor

  Labels: Cassandra

 

Attempting to use version 2.0.0-M3 of the datastax/spark-cassandra-connector to 
select the most recent version of each partition key using the Cassandra 3.6 
and later PER PARTITION LIMIT option fails. I've tried using all the Cassandra 
Java RDD's and Spark Sql with and without partition key equality constraints. 
All attempts have failed due to syntax errors and/or start/end bound 
restriction errors.

The 
[BrentDorsey/cassandra-spark-job|https://github.com/BrentDorsey/cassandra-spark-job]
 repo contains working code that demonstrates the error. Clone the repo, create 
the keyspace and table locally and supply connection information then run main.

Spark Dataset .where & Spark Sql Errors:

{code:title=errors|borderStyle=solid}

ERROR [2017-01-27 06:35:19,919] (main) 
org.per.partition.limit.test.spark.job.Main: 
getSparkDatasetPerPartitionLimitTestWithTokenGreaterThan failed.

org.apache.spark.sql.catalyst.parser.ParseException: 

mismatched input 'PARTITION' expecting (line 1, pos 67)

== SQL ==

TOKEN(item_uuid) > TOKEN(6616b548-4fd1-4661-a938-0af3c77357f7) PER PARTITION 
LIMIT 1

---^^^

    at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)

    at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)

    at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)

    at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseExpression(ParseDriver.scala:43)

    at org.apache.spark.sql.Dataset.where(Dataset.scala:1153)

    at 
org.per.partition.limit.test.spark.job.Main.getSparkDatasetPerPartitionLimitTestWithTokenGreaterThan(Main.java:349)

    at org.per.partition.limit.test.spark.job.Main.run(Main.java:128)

    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

    at 

[jira] [Resolved] (SPARK-19087) Numpy types fail to be casted to any other types

2017-01-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19087.
---
Resolution: Duplicate

> Numpy types fail to be casted to any other types
> 
>
> Key: SPARK-19087
> URL: https://issues.apache.org/jira/browse/SPARK-19087
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: Red Hat 6.6, Spark 1.6.0, Python 2.7.10 (anaconda)
>Reporter: Ivan SPM
>Priority: Minor
>  Labels: numpy, type-converter
>
> An UDF cannot return a numpy type, it has to be one of the Python basic 
> types. The error when a numpy type is returned is
> TypeError: ufunc 'nan' not supported for the input types, and the inputs 
> could not be safely coerced to any supported types according to the casting 
> rule ''safe''



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19369) SparkConf not getting properly initialized in PySpark 2.1.0

2017-01-27 Thread Sidney Feiner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842726#comment-15842726
 ] 

Sidney Feiner commented on SPARK-19369:
---

Ok, thanks :) When is 2.1.1 estimated to be released?

> SparkConf not getting properly initialized in PySpark 2.1.0
> ---
>
> Key: SPARK-19369
> URL: https://issues.apache.org/jira/browse/SPARK-19369
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Windows/Linux
>Reporter: Sidney Feiner
>  Labels: configurations, context, pyspark
>
> Trying to migrate from Spark 1.6 to 2.1, I've stumbled upon a small problem - 
> my SparkContext doesn't get its configurations from the SparkConf object. 
> Before passing them onto to the SparkContext constructor, I've made sure my 
> configuration are set.
> I've done some digging and this is what I've found:
> When I initialize the SparkContext, the following code is executed:
> def _do_init(self, master, appName, sparkHome, pyFiles, environment, 
> batchSize, serializer,
>  conf, jsc, profiler_cls):
> self.environment = environment or {}
> if conf is not None and conf._jconf is not None:
>self._conf = conf
> else:
> self._conf = SparkConf(_jvm=SparkContext._jvm)
> So I can see that the only way that my SparkConf will be used is if it also 
> has a _jvm object.
> I've used spark-submit to submit my job and printed the _jvm object but it is 
> null, which explains why my SparkConf object is ignored.
> I've tried running exactly the same on Spark 2.0.1 and it worked! My 
> SparkConf object had a valid _jvm object.
> Am i doing something wrong or is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2017-01-27 Thread Gregor Moehler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842701#comment-15842701
 ] 

Gregor Moehler commented on SPARK-7768:
---

+1 for UdtRegistration. Can we make it public for the next spark release?

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12606) Scala/Java compatibility issue Re: how to extend java transformer from Scala UnaryTransformer ?

2017-01-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842630#comment-15842630
 ] 

Hyukjin Kwon commented on SPARK-12606:
--

Do you mind if I ask to fix the title and contents in more details and clear as 
a descriptive JIRA?

I thought this is a question when I read this first. I hope the JIRA focuses 
only on the issue without irrelevant information (as it seems there have been 
no interests on this issue more than a year).

> Scala/Java compatibility issue Re: how to extend java transformer from Scala 
> UnaryTransformer ?
> ---
>
> Key: SPARK-12606
> URL: https://issues.apache.org/jira/browse/SPARK-12606
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.2
> Environment: Java 8, Mac OS, Spark-1.5.2
>Reporter: Andrew Davidson
>  Labels: transformers
>
> Hi Andy,
> I suspect that you hit the Scala/Java compatibility issue, I can also 
> reproduce this issue, so could you file a JIRA to track this issue?
> Yanbo
> 2016-01-02 3:38 GMT+08:00 Andy Davidson :
> I am trying to write a trivial transformer I use use in my pipeline. I am 
> using java and spark 1.5.2. It was suggested that I use the Tokenize.scala 
> class as an example. This should be very easy how ever I do not understand 
> Scala, I am having trouble debugging the following exception.
> Any help would be greatly appreciated.
> Happy New Year
> Andy
> java.lang.IllegalArgumentException: requirement failed: Param null__inputCol 
> does not belong to Stemmer_2f3aa96d-7919-4eaa-ad54-f7c620b92d1c.
>   at scala.Predef$.require(Predef.scala:233)
>   at org.apache.spark.ml.param.Params$class.shouldOwn(params.scala:557)
>   at org.apache.spark.ml.param.Params$class.set(params.scala:436)
>   at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
>   at org.apache.spark.ml.param.Params$class.set(params.scala:422)
>   at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
>   at 
> org.apache.spark.ml.UnaryTransformer.setInputCol(Transformer.scala:83)
>   at com.pws.xxx.ml.StemmerTest.test(StemmerTest.java:30)
> public class StemmerTest extends AbstractSparkTest {
> @Test
> public void test() {
> Stemmer stemmer = new Stemmer()
> .setInputCol("raw”) //line 30
> .setOutputCol("filtered");
> }
> }
> /**
>  * @ see 
> spark-1.5.1/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
>  * @ see 
> https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
>  * @ see 
> http://www.tonytruong.net/movie-rating-prediction-with-apache-spark-and-hortonworks/
>  * 
>  * @author andrewdavidson
>  *
>  */
> public class Stemmer extends UnaryTransformer Stemmer> implements Serializable{
> static Logger logger = LoggerFactory.getLogger(Stemmer.class);
> private static final long serialVersionUID = 1L;
> private static final  ArrayType inputType = 
> DataTypes.createArrayType(DataTypes.StringType, true);
> private final String uid = Stemmer.class.getSimpleName() + "_" + 
> UUID.randomUUID().toString();
> @Override
> public String uid() {
> return uid;
> }
> /*
>override protected def validateInputType(inputType: DataType): Unit = {
> require(inputType == StringType, s"Input type must be string type but got 
> $inputType.")
>   }
>  */
> @Override
> public void validateInputType(DataType inputTypeArg) {
> String msg = "inputType must be " + inputType.simpleString() + " but 
> got " + inputTypeArg.simpleString();
> assert (inputType.equals(inputTypeArg)) : msg; 
> }
> 
> @Override
> public Function1 createTransformFunc() {
> // 
> http://stackoverflow.com/questions/6545066/using-scala-from-java-passing-functions-as-parameters
> Function1 f = new 
> AbstractFunction1() {
> public List apply(List words) {
> for(String word : words) {
> logger.error("AEDWIP input word: {}", word);
> }
> return words;
> }
> };
> 
> return f;
> }
> @Override
> public DataType outputDataType() {
> return DataTypes.createArrayType(DataTypes.StringType, true);
> }
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19384) forget unpersist input dataset in IsotonicRegression

2017-01-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842629#comment-15842629
 ] 

Apache Spark commented on SPARK-19384:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/16718

> forget unpersist input dataset in IsotonicRegression
> 
>
> Key: SPARK-19384
> URL: https://issues.apache.org/jira/browse/SPARK-19384
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Trivial
>
> forget unpersist input dataset in IsotonicRegression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19384) forget unpersist input dataset in IsotonicRegression

2017-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19384:


Assignee: Apache Spark

> forget unpersist input dataset in IsotonicRegression
> 
>
> Key: SPARK-19384
> URL: https://issues.apache.org/jira/browse/SPARK-19384
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Trivial
>
> forget unpersist input dataset in IsotonicRegression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19384) forget unpersist input dataset in IsotonicRegression

2017-01-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19384:


Assignee: (was: Apache Spark)

> forget unpersist input dataset in IsotonicRegression
> 
>
> Key: SPARK-19384
> URL: https://issues.apache.org/jira/browse/SPARK-19384
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Trivial
>
> forget unpersist input dataset in IsotonicRegression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19384) forget unpersist input dataset in IsotonicRegression

2017-01-27 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-19384:


 Summary: forget unpersist input dataset in IsotonicRegression
 Key: SPARK-19384
 URL: https://issues.apache.org/jira/browse/SPARK-19384
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: zhengruifeng
Priority: Trivial


forget unpersist input dataset in IsotonicRegression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19087) Numpy types fail to be casted to any other types

2017-01-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842619#comment-15842619
 ] 

Hyukjin Kwon commented on SPARK-19087:
--

Is this a duplicate of SPARK-12157?

> Numpy types fail to be casted to any other types
> 
>
> Key: SPARK-19087
> URL: https://issues.apache.org/jira/browse/SPARK-19087
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: Red Hat 6.6, Spark 1.6.0, Python 2.7.10 (anaconda)
>Reporter: Ivan SPM
>Priority: Minor
>  Labels: numpy, type-converter
>
> An UDF cannot return a numpy type, it has to be one of the Python basic 
> types. The error when a numpy type is returned is
> TypeError: ufunc 'nan' not supported for the input types, and the inputs 
> could not be safely coerced to any supported types according to the casting 
> rule ''safe''



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10908) ClassCastException in HadoopRDD.getJobConf

2017-01-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842610#comment-15842610
 ] 

Hyukjin Kwon commented on SPARK-10908:
--

Would you be possible to provide a self-reproducer? I am willing to help verify 
this.

> ClassCastException in HadoopRDD.getJobConf
> --
>
> Key: SPARK-10908
> URL: https://issues.apache.org/jira/browse/SPARK-10908
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Franco
>
> Whilst running a Spark SQL job (I can't provide an explain plan as many of 
> these are happening concurrently) the following exception is thrown:
> java.lang.ClassCastException: [B cannot be cast to 
> org.apache.spark.util.SerializableConfiguration
> rg.apache.spark.util.SerializableConfiguration
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.ShuffleDependency.(Dependency.scala:82)
> at 
> org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:78)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19368) Very bad performance in BlockMatrix.toIndexedRowMatrix()

2017-01-27 Thread Ohad Raviv (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842581#comment-15842581
 ] 

Ohad Raviv commented on SPARK-19368:


well, not with the same elegant code. the main problem is that Sparse Vector is 
very inefficient to manipulate. from Breeze's site:
{quote}
You should not be adding lots of values to a SparseVector if you want good 
speed. SparseVectors have to maintain the invariant that the index array is 
always sorted, which makes insertions expensive.
{quote}
and then they suggest to use VectorBuilder for instead, but that is only good 
for SparseVector. with DenseVector the current implementation is better.
so if you want I can just create two different functions for Sparse/Desne cases.

> Very bad performance in BlockMatrix.toIndexedRowMatrix()
> 
>
> Key: SPARK-19368
> URL: https://issues.apache.org/jira/browse/SPARK-19368
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Ohad Raviv
>Priority: Minor
> Attachments: profiler snapshot.png
>
>
> In SPARK-12869, this function was optimized for the case of dense matrices 
> using Breeze. However, I have a case with very very sparse matrices which 
> suffers a great deal from this optimization. A process we have that took 
> about 20 mins now takes about 6.5 hours.
> Here is a sample code to see the difference:
> {quote}
> val n = 4
> val density = 0.0002
> val rnd = new Random(123)
> val rndEntryList = (for (i <- 0 until (n*n*density).toInt) yield 
> (rnd.nextInt\(n\), rnd.nextInt\(n\), rnd.nextDouble()))
>   .groupBy(t => (t._1,t._2)).map\(t => t._2.last).map\{ case 
> (i,j,d) => (i,(j,d)) }.toSeq
> val entries: RDD\[(Int, (Int, Double))] = sc.parallelize(rndEntryList, 10)
> val indexedRows = entries.groupByKey().map(e => IndexedRow(e._1, 
> Vectors.sparse(n, e._2.toSeq)))
> val mat = new IndexedRowMatrix(indexedRows, nRows = n, nCols = n)
> val t1 = System.nanoTime()
> 
> println(mat.toBlockMatrix(1,1).toCoordinateMatrix().toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
> val t2 = System.nanoTime()
> println("took: " + (t2 - t1) / 1000 / 1000 + " ms")
> println("")
> 
> println(mat.toBlockMatrix(1,1).toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
> val t3 = System.nanoTime()
> println("took: " + (t3 - t2) / 1000 / 1000 + " ms")
> println("")
> {quote}
> I get:
> {quote}
> took: 9404 ms
> 
> took: 57350 ms
> 
> {quote}
> Looking at it a little with a profiler, I see that the problem is with the 
> SliceVector.update() and SparseVector.apply.
> I currently work-around this by doing:
> {quote}
> blockMatrix.toCoordinateMatrix().toIndexedRowMatrix()
> {quote}
> like it was in version 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19383) Spark Sql Fails with Cassandra 3.6 and later PER PARTITION LIMIT option

2017-01-27 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842524#comment-15842524
 ] 

Herman van Hovell commented on SPARK-19383:
---

This is definitely not Hive/Spark syntax and something we are going to support. 
Also see [~rspitzer]'s answer on your stack overflow question 
(http://stackoverflow.com/questions/41887041/spark-cassandra-connector-per-partition-limit).
 Closing as a won't fix.

> Spark Sql Fails with Cassandra 3.6 and later PER PARTITION LIMIT option 
> 
>
> Key: SPARK-19383
> URL: https://issues.apache.org/jira/browse/SPARK-19383
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: PER PARTITION LIMIT Error documented in github and 
> reproducible by cloning: 
> [BrentDorsey/cassandra-spark-job|https://github.com/BrentDorsey/cassandra-spark-job]
> Java 1.8
> Cassandra Version
> [cqlsh 5.0.1 | Cassandra 3.9.0 | CQL spec 3.4.2 | Native protocol v4]
> {code:title=POM.xml|borderStyle=solid}
> 
> com.datastax.spark
> spark-cassandra-connector_2.10
> 2.0.0-M3
> 
> 
> com.datastax.cassandra
> cassandra-driver-mapping
> 3.1.2
> 
> 
> org.apache.hadoop
> hadoop-common
> 2.72
> compile
> 
> 
> org.apache.spark
> spark-catalyst_2.10
> 2.0.2
> compile
> 
> 
> org.apache.spark
> spark-core_2.10
> 2.0.2
> compile
> 
> 
> org.apache.spark
> spark-sql_2.10
> 2.0.2
> compile
> 
> {code}
>Reporter: Brent Dorsey
>Priority: Minor
>  Labels: Cassandra
>
> Attempting to use version 2.0.0-M3 of the datastax/spark-cassandra-connector 
> to select the most recent version of each partition key using the Cassandra 
> 3.6 and later PER PARTITION LIMIT option fails. I've tried using all the 
> Cassandra Java RDD's and Spark Sql with and without partition key equality 
> constraints. All attempts have failed due to syntax errors and/or start/end 
> bound restriction errors.
> The 
> [BrentDorsey/cassandra-spark-job|https://github.com/BrentDorsey/cassandra-spark-job]
>  repo contains working code that demonstrates the error. Clone the repo, 
> create the keyspace and table locally and supply connection information then 
> run main.
> Spark Dataset .where & Spark Sql Errors:
> {code:title=errors|borderStyle=solid}
> ERROR [2017-01-27 06:35:19,919] (main) 
> org.per.partition.limit.test.spark.job.Main: 
> getSparkDatasetPerPartitionLimitTestWithTokenGreaterThan failed.
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input 'PARTITION' expecting (line 1, pos 67)
> == SQL ==
> TOKEN(item_uuid) > TOKEN(6616b548-4fd1-4661-a938-0af3c77357f7) PER PARTITION 
> LIMIT 1
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseExpression(ParseDriver.scala:43)
>   at org.apache.spark.sql.Dataset.where(Dataset.scala:1153)
>   at 
> org.per.partition.limit.test.spark.job.Main.getSparkDatasetPerPartitionLimitTestWithTokenGreaterThan(Main.java:349)
>   at org.per.partition.limit.test.spark.job.Main.run(Main.java:128)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>   at org.per.partition.limit.test.spark.job.Main.main(Main.java:72)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> ERROR [2017-01-27 06:35:20,238] (main) 
> org.per.partition.limit.test.spark.job.Main: 
> getSparkSqlDatasetPerPartitionLimitTest failed.
> org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input ''' expecting {'(', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 
> 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
> 'BETWEEN', 

[jira] [Comment Edited] (SPARK-19383) Spark Sql Fails with Cassandra 3.6 and later PER PARTITION LIMIT option

2017-01-27 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842524#comment-15842524
 ] 

Herman van Hovell edited comment on SPARK-19383 at 1/27/17 10:39 AM:
-

This is definitely not Hive/Spark syntax and something we are going to support. 
Also see [~rspitzer]'s answer on your stack overflow question 
(http://stackoverflow.com/questions/41887041/spark-cassandra-connector-per-partition-limit).
 Closing as not a problem.


was (Author: hvanhovell):
This is definitely not Hive/Spark syntax and something we are going to support. 
Also see [~rspitzer]'s answer on your stack overflow question 
(http://stackoverflow.com/questions/41887041/spark-cassandra-connector-per-partition-limit).
 Closing as a won't fix.

> Spark Sql Fails with Cassandra 3.6 and later PER PARTITION LIMIT option 
> 
>
> Key: SPARK-19383
> URL: https://issues.apache.org/jira/browse/SPARK-19383
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: PER PARTITION LIMIT Error documented in github and 
> reproducible by cloning: 
> [BrentDorsey/cassandra-spark-job|https://github.com/BrentDorsey/cassandra-spark-job]
> Java 1.8
> Cassandra Version
> [cqlsh 5.0.1 | Cassandra 3.9.0 | CQL spec 3.4.2 | Native protocol v4]
> {code:title=POM.xml|borderStyle=solid}
> 
> com.datastax.spark
> spark-cassandra-connector_2.10
> 2.0.0-M3
> 
> 
> com.datastax.cassandra
> cassandra-driver-mapping
> 3.1.2
> 
> 
> org.apache.hadoop
> hadoop-common
> 2.72
> compile
> 
> 
> org.apache.spark
> spark-catalyst_2.10
> 2.0.2
> compile
> 
> 
> org.apache.spark
> spark-core_2.10
> 2.0.2
> compile
> 
> 
> org.apache.spark
> spark-sql_2.10
> 2.0.2
> compile
> 
> {code}
>Reporter: Brent Dorsey
>Priority: Minor
>  Labels: Cassandra
>
> Attempting to use version 2.0.0-M3 of the datastax/spark-cassandra-connector 
> to select the most recent version of each partition key using the Cassandra 
> 3.6 and later PER PARTITION LIMIT option fails. I've tried using all the 
> Cassandra Java RDD's and Spark Sql with and without partition key equality 
> constraints. All attempts have failed due to syntax errors and/or start/end 
> bound restriction errors.
> The 
> [BrentDorsey/cassandra-spark-job|https://github.com/BrentDorsey/cassandra-spark-job]
>  repo contains working code that demonstrates the error. Clone the repo, 
> create the keyspace and table locally and supply connection information then 
> run main.
> Spark Dataset .where & Spark Sql Errors:
> {code:title=errors|borderStyle=solid}
> ERROR [2017-01-27 06:35:19,919] (main) 
> org.per.partition.limit.test.spark.job.Main: 
> getSparkDatasetPerPartitionLimitTestWithTokenGreaterThan failed.
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input 'PARTITION' expecting (line 1, pos 67)
> == SQL ==
> TOKEN(item_uuid) > TOKEN(6616b548-4fd1-4661-a938-0af3c77357f7) PER PARTITION 
> LIMIT 1
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseExpression(ParseDriver.scala:43)
>   at org.apache.spark.sql.Dataset.where(Dataset.scala:1153)
>   at 
> org.per.partition.limit.test.spark.job.Main.getSparkDatasetPerPartitionLimitTestWithTokenGreaterThan(Main.java:349)
>   at org.per.partition.limit.test.spark.job.Main.run(Main.java:128)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>   at org.per.partition.limit.test.spark.job.Main.main(Main.java:72)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> ERROR [2017-01-27 06:35:20,238] (main) 
> org.per.partition.limit.test.spark.job.Main: 
> 

[jira] [Closed] (SPARK-19383) Spark Sql Fails with Cassandra 3.6 and later PER PARTITION LIMIT option

2017-01-27 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-19383.
-
Resolution: Not A Problem

> Spark Sql Fails with Cassandra 3.6 and later PER PARTITION LIMIT option 
> 
>
> Key: SPARK-19383
> URL: https://issues.apache.org/jira/browse/SPARK-19383
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: PER PARTITION LIMIT Error documented in github and 
> reproducible by cloning: 
> [BrentDorsey/cassandra-spark-job|https://github.com/BrentDorsey/cassandra-spark-job]
> Java 1.8
> Cassandra Version
> [cqlsh 5.0.1 | Cassandra 3.9.0 | CQL spec 3.4.2 | Native protocol v4]
> {code:title=POM.xml|borderStyle=solid}
> 
> com.datastax.spark
> spark-cassandra-connector_2.10
> 2.0.0-M3
> 
> 
> com.datastax.cassandra
> cassandra-driver-mapping
> 3.1.2
> 
> 
> org.apache.hadoop
> hadoop-common
> 2.72
> compile
> 
> 
> org.apache.spark
> spark-catalyst_2.10
> 2.0.2
> compile
> 
> 
> org.apache.spark
> spark-core_2.10
> 2.0.2
> compile
> 
> 
> org.apache.spark
> spark-sql_2.10
> 2.0.2
> compile
> 
> {code}
>Reporter: Brent Dorsey
>Priority: Minor
>  Labels: Cassandra
>
> Attempting to use version 2.0.0-M3 of the datastax/spark-cassandra-connector 
> to select the most recent version of each partition key using the Cassandra 
> 3.6 and later PER PARTITION LIMIT option fails. I've tried using all the 
> Cassandra Java RDD's and Spark Sql with and without partition key equality 
> constraints. All attempts have failed due to syntax errors and/or start/end 
> bound restriction errors.
> The 
> [BrentDorsey/cassandra-spark-job|https://github.com/BrentDorsey/cassandra-spark-job]
>  repo contains working code that demonstrates the error. Clone the repo, 
> create the keyspace and table locally and supply connection information then 
> run main.
> Spark Dataset .where & Spark Sql Errors:
> {code:title=errors|borderStyle=solid}
> ERROR [2017-01-27 06:35:19,919] (main) 
> org.per.partition.limit.test.spark.job.Main: 
> getSparkDatasetPerPartitionLimitTestWithTokenGreaterThan failed.
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input 'PARTITION' expecting (line 1, pos 67)
> == SQL ==
> TOKEN(item_uuid) > TOKEN(6616b548-4fd1-4661-a938-0af3c77357f7) PER PARTITION 
> LIMIT 1
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseExpression(ParseDriver.scala:43)
>   at org.apache.spark.sql.Dataset.where(Dataset.scala:1153)
>   at 
> org.per.partition.limit.test.spark.job.Main.getSparkDatasetPerPartitionLimitTestWithTokenGreaterThan(Main.java:349)
>   at org.per.partition.limit.test.spark.job.Main.run(Main.java:128)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>   at org.per.partition.limit.test.spark.job.Main.main(Main.java:72)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> ERROR [2017-01-27 06:35:20,238] (main) 
> org.per.partition.limit.test.spark.job.Main: 
> getSparkSqlDatasetPerPartitionLimitTest failed.
> org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input ''' expecting {'(', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 
> 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
> 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
> 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
> 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 

[jira] [Updated] (SPARK-19379) SparkAppHandle.getState not registering FAILED state upon Spark app failure in Local mode

2017-01-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19379:
--
Priority: Minor  (was: Blocker)

(Don't set Blocker)

> SparkAppHandle.getState not registering FAILED state upon Spark app failure 
> in Local mode
> -
>
> Key: SPARK-19379
> URL: https://issues.apache.org/jira/browse/SPARK-19379
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Adam Kramer
>Priority: Minor
>
> LocalSchedulerBackend does not handle calling back to the Launcher upon 
> TaskState change. It does send a callback to setState to FINISHED upon 
> stop(). Apps that are FAILED are set as FINISHED in SparkAppHandle.State.
> It looks like a case statement is needed in the statusUpdate() method in 
> LocalSchedulerBacked to call stop( state) or  launcherBackend.setState(state) 
> with the appropriate SparkAppHandle.State for TaskStates FAILED, LAUNCHING, 
> and, possibly, FINISHED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19380) YARN - Dynamic allocation should use configured number of executors as max number of executors

2017-01-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19380:
--
Target Version/s:   (was: 1.6.4)

(Don't set target version)
This doesn't sound like a problem. You're saying the number of executors can 
grow to the max number. Of course it can.

> YARN - Dynamic allocation should use configured number of executors as max 
> number of executors
> --
>
> Key: SPARK-19380
> URL: https://issues.apache.org/jira/browse/SPARK-19380
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.3
>Reporter: Zhe Zhang
>
>  SPARK-13723 only uses user's number of executors as the initial number of 
> executors when dynamic allocation is turned on.
> If the configured max number of executors is larger than the number of 
> executors requested by the user, user's application could continue to request 
> for more executors to reach the configured max number if there're tasks 
> backed up. This behavior is not very friendly to the cluster if we allow 
> every Spark application to reach the max number of executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19383) Spark Sql Fails with Cassandra 3.6 and later PER PARTITION LIMIT option

2017-01-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842488#comment-15842488
 ] 

Sean Owen commented on SPARK-19383:
---

This sounds like unsupported syntax. I'm not even sure "PER PARTITION LIMIT" 
exists in Hive?

> Spark Sql Fails with Cassandra 3.6 and later PER PARTITION LIMIT option 
> 
>
> Key: SPARK-19383
> URL: https://issues.apache.org/jira/browse/SPARK-19383
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: PER PARTITION LIMIT Error documented in github and 
> reproducible by cloning: 
> [BrentDorsey/cassandra-spark-job|https://github.com/BrentDorsey/cassandra-spark-job]
> Java 1.8
> Cassandra Version
> [cqlsh 5.0.1 | Cassandra 3.9.0 | CQL spec 3.4.2 | Native protocol v4]
> {code:title=POM.xml|borderStyle=solid}
> 
> com.datastax.spark
> spark-cassandra-connector_2.10
> 2.0.0-M3
> 
> 
> com.datastax.cassandra
> cassandra-driver-mapping
> 3.1.2
> 
> 
> org.apache.hadoop
> hadoop-common
> 2.72
> compile
> 
> 
> org.apache.spark
> spark-catalyst_2.10
> 2.0.2
> compile
> 
> 
> org.apache.spark
> spark-core_2.10
> 2.0.2
> compile
> 
> 
> org.apache.spark
> spark-sql_2.10
> 2.0.2
> compile
> 
> {code}
>Reporter: Brent Dorsey
>Priority: Minor
>  Labels: Cassandra
>
> Attempting to use version 2.0.0-M3 of the datastax/spark-cassandra-connector 
> to select the most recent version of each partition key using the Cassandra 
> 3.6 and later PER PARTITION LIMIT option fails. I've tried using all the 
> Cassandra Java RDD's and Spark Sql with and without partition key equality 
> constraints. All attempts have failed due to syntax errors and/or start/end 
> bound restriction errors.
> The 
> [BrentDorsey/cassandra-spark-job|https://github.com/BrentDorsey/cassandra-spark-job]
>  repo contains working code that demonstrates the error. Clone the repo, 
> create the keyspace and table locally and supply connection information then 
> run main.
> Spark Dataset .where & Spark Sql Errors:
> {code:title=errors|borderStyle=solid}
> ERROR [2017-01-27 06:35:19,919] (main) 
> org.per.partition.limit.test.spark.job.Main: 
> getSparkDatasetPerPartitionLimitTestWithTokenGreaterThan failed.
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input 'PARTITION' expecting (line 1, pos 67)
> == SQL ==
> TOKEN(item_uuid) > TOKEN(6616b548-4fd1-4661-a938-0af3c77357f7) PER PARTITION 
> LIMIT 1
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseExpression(ParseDriver.scala:43)
>   at org.apache.spark.sql.Dataset.where(Dataset.scala:1153)
>   at 
> org.per.partition.limit.test.spark.job.Main.getSparkDatasetPerPartitionLimitTestWithTokenGreaterThan(Main.java:349)
>   at org.per.partition.limit.test.spark.job.Main.run(Main.java:128)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>   at org.per.partition.limit.test.spark.job.Main.main(Main.java:72)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> ERROR [2017-01-27 06:35:20,238] (main) 
> org.per.partition.limit.test.spark.job.Main: 
> getSparkSqlDatasetPerPartitionLimitTest failed.
> org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input ''' expecting {'(', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 
> 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
> 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
> 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 

[jira] [Updated] (SPARK-12970) Error in documentation on creating rows with schemas defined by structs

2017-01-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12970:
--
Assignee: Hyukjin Kwon

> Error in documentation on creating rows with schemas defined by structs
> ---
>
> Key: SPARK-12970
> URL: https://issues.apache.org/jira/browse/SPARK-12970
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Haidar Hadi
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: documentation
> Fix For: 2.2.0
>
>
> The provided example in this doc 
> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/types/StructType.html
>  for creating Row from Struct is wrong
>  // Create a Row with the schema defined by struct
>  val row = Row(Row(1, 2, true))
>  // row: Row = {@link 1,2,true}
>  
> the above example does not create a Row object with schema.
> this error is in the scala docs too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12970) Error in documentation on creating rows with schemas defined by structs

2017-01-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12970.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16703
[https://github.com/apache/spark/pull/16703]

> Error in documentation on creating rows with schemas defined by structs
> ---
>
> Key: SPARK-12970
> URL: https://issues.apache.org/jira/browse/SPARK-12970
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Haidar Hadi
>Priority: Minor
>  Labels: documentation
> Fix For: 2.2.0
>
>
> The provided example in this doc 
> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/types/StructType.html
>  for creating Row from Struct is wrong
>  // Create a Row with the schema defined by struct
>  val row = Row(Row(1, 2, true))
>  // row: Row = {@link 1,2,true}
>  
> the above example does not create a Row object with schema.
> this error is in the scala docs too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19304) Kinesis checkpoint recovery is 10x slow

2017-01-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19304:
--
Issue Type: Improvement  (was: Bug)

Sounds reasonable. Please go ahead with a PR.

> Kinesis checkpoint recovery is 10x slow
> ---
>
> Key: SPARK-19304
> URL: https://issues.apache.org/jira/browse/SPARK-19304
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: using s3 for checkpoints using 1 executor, with 19g mem 
> & 3 cores per executor
>Reporter: Gaurav Shah
>  Labels: kinesis
>
> Application runs fine initially, running batches of 1hour and the processing 
> time is less than 30 minutes on average. For some reason lets say the 
> application crashes, and we try to restart from checkpoint. The processing 
> now takes forever and does not move forward. We tried to test out the same 
> thing at batch interval of 1 minute, the processing runs fine and takes 1.2 
> minutes for batch to finish. When we recover from checkpoint it takes about 
> 15 minutes for each batch. Post the recovery the batches again process at 
> normal speed
> I suspect the KinesisBackedBlockRDD used for recovery is causing the slowdown.
> Stackoverflow post with more details: 
> http://stackoverflow.com/questions/38390567/spark-streaming-checkpoint-recovery-is-very-very-slow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-01-27 Thread Nils Grabbert (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nils Grabbert updated SPARK-19104:
--
Component/s: Optimizer

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Nils Grabbert
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>  
>   at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) 
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196) 
>   at