[jira] [Commented] (SPARK-18039) ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced

2016-10-20 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594254#comment-15594254
 ] 

Liwei Lin commented on SPARK-18039:
---

hi [~astralidea], if I understand correctly, configuring 
`spark.scheduler.minRegisteredResourcesRatio` may help.

> ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced
> -
>
> Key: SPARK-18039
> URL: https://issues.apache.org/jira/browse/SPARK-18039
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
>Reporter: astralidea
>Priority: Minor
>
> receiver scheduling balance is important for me 
> for instance 
> if I have 2 executor, each executor has 1 receiver, calc time is 0.1s per 
> batch.
> but if  I have 2 executor, one executor has 2 receiver and another is 0 
> receiver ,calc time is increase 3s per batch.
> In my cluster executor init is slow I need about 30s to wait.
> but dummy job only run 4s to wait, I add conf 
> spark.scheduler.maxRegisteredResourcesWaitingTime it does not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18040) Improve R handling or messaging of JVM exception

2016-10-20 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18040:
-
Description: 
Similar to SPARK-17838, there are a few cases where an exception can be thrown 
from the JVM side when an action is performed (head, count, collect).

For example, any error with planner can and only happen then. We need to have 
error handling for those cases to present the error more clearly in R instead 
of a long Java stacktrace.


> Improve R handling or messaging of JVM exception
> 
>
> Key: SPARK-18040
> URL: https://issues.apache.org/jira/browse/SPARK-18040
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Felix Cheung
>Priority: Minor
>
> Similar to SPARK-17838, there are a few cases where an exception can be 
> thrown from the JVM side when an action is performed (head, count, collect).
> For example, any error with planner can and only happen then. We need to have 
> error handling for those cases to present the error more clearly in R instead 
> of a long Java stacktrace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17254) Filter operator should have “stop if false” semantics for sorted data

2016-10-20 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-17254:

Attachment: (was: stop-after-physical-plan.pdf)

> Filter operator should have “stop if false” semantics for sorted data
> -
>
> Key: SPARK-17254
> URL: https://issues.apache.org/jira/browse/SPARK-17254
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tejas Patil
> Attachments: stop-after-physical-plan.pdf
>
>
> From 
> https://issues.apache.org/jira/secure/attachment/12778890/BucketedTables.pdf:
> Filter on sorted data
> If the data is sorted by a key, filters on the key could stop as soon as the 
> data is out of range. For example, WHERE ticker_id < “F” should stop as soon 
> as the first row starting with “F” is seen. This can be done adding a Filter 
> operator that has “stop if false” semantics. This is generally useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17254) Filter operator should have “stop if false” semantics for sorted data

2016-10-20 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-17254:

Attachment: stop-after-physical-plan.pdf

> Filter operator should have “stop if false” semantics for sorted data
> -
>
> Key: SPARK-17254
> URL: https://issues.apache.org/jira/browse/SPARK-17254
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tejas Patil
> Attachments: stop-after-physical-plan.pdf
>
>
> From 
> https://issues.apache.org/jira/secure/attachment/12778890/BucketedTables.pdf:
> Filter on sorted data
> If the data is sorted by a key, filters on the key could stop as soon as the 
> data is out of range. For example, WHERE ticker_id < “F” should stop as soon 
> as the first row starting with “F” is seen. This can be done adding a Filter 
> operator that has “stop if false” semantics. This is generally useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18041) activedrivers section in http:sparkMasterurl/json is missing Main class information

2016-10-20 Thread sudheesh k s (JIRA)
sudheesh k s created SPARK-18041:


 Summary: activedrivers section in http:sparkMasterurl/json is 
missing Main class information
 Key: SPARK-18041
 URL: https://issues.apache.org/jira/browse/SPARK-18041
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.6.2
Reporter: sudheesh k s
Priority: Minor


http:sparkMaster_Url/json gives the status of running applications as well as 
drivers. But it is missing information like, driver main class. 

To identify which driver is running on driver class information is needed. 
eg:
  "activedrivers" : [ {
"id" : "driver-20161020173528-0032",
"starttime" : "1476965128734",
"state" : "RUNNING",
"cores" : 1,
"memory" : 1024
  } ],



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18040) Improve R handling or messaging of JVM exception

2016-10-20 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-18040:


 Summary: Improve R handling or messaging of JVM exception
 Key: SPARK-18040
 URL: https://issues.apache.org/jira/browse/SPARK-18040
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.2, 2.1.0
Reporter: Felix Cheung
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17275) Flaky test: org.apache.spark.deploy.RPackageUtilsSuite.jars that don't exist are skipped and print warning

2016-10-20 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594080#comment-15594080
 ] 

Felix Cheung commented on SPARK-17275:
--

is this still a problem?

> Flaky test: org.apache.spark.deploy.RPackageUtilsSuite.jars that don't exist 
> are skipped and print warning
> --
>
> Key: SPARK-17275
> URL: https://issues.apache.org/jira/browse/SPARK-17275
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yin Huai
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1623/testReport/junit/org.apache.spark.deploy/RPackageUtilsSuite/jars_that_don_t_exist_are_skipped_and_print_warning/
> {code}
> Error Message
> java.io.IOException: Unable to delete directory 
> /home/jenkins/.ivy2/cache/a/mylib.
> Stacktrace
> sbt.ForkMain$ForkError: java.io.IOException: Unable to delete directory 
> /home/jenkins/.ivy2/cache/a/mylib.
>   at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1541)
>   at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270)
>   at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
>   at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
>   at 
> org.apache.spark.deploy.IvyTestUtils$.purgeLocalIvyCache(IvyTestUtils.scala:394)
>   at 
> org.apache.spark.deploy.IvyTestUtils$.withRepository(IvyTestUtils.scala:384)
>   at 
> org.apache.spark.deploy.RPackageUtilsSuite$$anonfun$3.apply$mcV$sp(RPackageUtilsSuite.scala:103)
>   at 
> org.apache.spark.deploy.RPackageUtilsSuite$$anonfun$3.apply(RPackageUtilsSuite.scala:100)
>   at 
> org.apache.spark.deploy.RPackageUtilsSuite$$anonfun$3.apply(RPackageUtilsSuite.scala:100)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.deploy.RPackageUtilsSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(RPackageUtilsSuite.scala:38)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.deploy.RPackageUtilsSuite.runTest(RPackageUtilsSuite.scala:38)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:29)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357)
>   at 
> o

[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-20 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594046#comment-15594046
 ] 

Felix Cheung commented on SPARK-17916:
--

So here's what happen.

First, R read.csv has clearly documented that it treats empty/blank string the 
same as NA in the following condition: "Blank fields are also considered to be 
missing values in logical, integer, numeric and complex fields."

Second, in this example in R, the 2nd column is turned into "logical", instead 
of "character" (ie. string) as expected:
{code}
> d <- "col1,col2
+ 1,\"-\"
+ 2,\"\""
> df <- read.csv(text=d, quote="\"", na.strings=c("-"))
> df
  col1 col2
11   NA
22   NA
> str(df)
'data.frame':   2 obs. of  2 variables:
 $ col1: int  1 2
 $ col2: logi  NA NA
{code}

And that is why the blank string is turned into NA.

Whereas if the data.frame has character/factor column instead, the blank field 
is retained as blank:
{code}
> d <- "col1,col2
+ 1,\"###\"
+ 2,\"\"
+ 3,\"this is a string\""
> df <- read.csv(text=d, quote="\"", na.strings=c("###"))
> df
  col1 col2
11 
22
33 this is a string
> str(df)
'data.frame':   3 obs. of  2 variables:
 $ col1: int  1 2 3
 $ col2: Factor w/ 2 levels "","this is a string": NA 1 2
{code}

IMO this behavior makes sense.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18029) PruneFileSourcePartitions should not change the output of LogicalRelation

2016-10-20 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18029.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15569
[https://github.com/apache/spark/pull/15569]

> PruneFileSourcePartitions should not change the output of LogicalRelation
> -
>
> Key: SPARK-18029
> URL: https://issues.apache.org/jira/browse/SPARK-18029
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL

2016-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593935#comment-15593935
 ] 

Maciej Bryński edited comment on SPARK-18022 at 10/21/16 4:24 AM:
--

I think the problem is in this PR.
https://github.com/apache/spark/commit/811a2cef03647c5be29fef522c423921c79b1bc3

CC: [~davies]


was (Author: maver1ck):
I think the problem is in this PR.
https://github.com/apache/spark/commit/811a2cef03647c5be29fef522c423921c79b1bc3

> java.lang.NullPointerException instead of real exception when saving DF to 
> MySQL
> 
>
> Key: SPARK-18022
> URL: https://issues.apache.org/jira/browse/SPARK-18022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Maciej Bryński
>Priority: Minor
>
> Hi,
> I have found following issue.
> When there is an exception while saving dataframe to MySQL I'm unable to get 
> it.
> Instead of I'm getting following stacktrace.
> {code}
> 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID 
> 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a 
> null exception.
> at java.lang.Throwable.addSuppressed(Throwable.java:1046)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The real exception could be for example duplicate on primary key etc.
> With this it's very difficult to debugging apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL

2016-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593935#comment-15593935
 ] 

Maciej Bryński commented on SPARK-18022:


I think the problem is in this PR.
https://github.com/apache/spark/commit/811a2cef03647c5be29fef522c423921c79b1bc3

> java.lang.NullPointerException instead of real exception when saving DF to 
> MySQL
> 
>
> Key: SPARK-18022
> URL: https://issues.apache.org/jira/browse/SPARK-18022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Maciej Bryński
>Priority: Minor
>
> Hi,
> I have found following issue.
> When there is an exception while saving dataframe to MySQL I'm unable to get 
> it.
> Instead of I'm getting following stacktrace.
> {code}
> 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID 
> 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a 
> null exception.
> at java.lang.Throwable.addSuppressed(Throwable.java:1046)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The real exception could be for example duplicate on primary key etc.
> With this it's very difficult to debugging apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18039) ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced

2016-10-20 Thread astralidea (JIRA)
astralidea created SPARK-18039:
--

 Summary: ReceiverTracker run dummyjob too fast cause receiver 
scheduling unbalaced
 Key: SPARK-18039
 URL: https://issues.apache.org/jira/browse/SPARK-18039
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.0.1
Reporter: astralidea
Priority: Minor


receiver scheduling balance is important for me 
for instance 
if I have 2 executor, each executor has 1 receiver, calc time is 0.1s per batch.
but if  I have 2 executor, one executor has 2 receiver and another is 0 
receiver ,calc time is increase 3s per batch.
In my cluster executor init is slow I need about 30s to wait.
but dummy job only run 4s to wait, I add conf 
spark.scheduler.maxRegisteredResourcesWaitingTime it does not work.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL

2016-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593846#comment-15593846
 ] 

Maciej Bryński edited comment on SPARK-18022 at 10/21/16 3:56 AM:
--

Only improvement in error handling.
Because right now I'm getting only NPE and I have to guess whats the real 
reason of error.



was (Author: maver1ck):
Only improvement in error handling.
Because right now I'm getting only NPE and have to guess whats the real reason 
of error.


> java.lang.NullPointerException instead of real exception when saving DF to 
> MySQL
> 
>
> Key: SPARK-18022
> URL: https://issues.apache.org/jira/browse/SPARK-18022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Maciej Bryński
>Priority: Minor
>
> Hi,
> I have found following issue.
> When there is an exception while saving dataframe to MySQL I'm unable to get 
> it.
> Instead of I'm getting following stacktrace.
> {code}
> 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID 
> 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a 
> null exception.
> at java.lang.Throwable.addSuppressed(Throwable.java:1046)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The real exception could be for example duplicate on primary key etc.
> With this it's very difficult to debugging apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL

2016-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593846#comment-15593846
 ] 

Maciej Bryński edited comment on SPARK-18022 at 10/21/16 3:39 AM:
--

Only improvement in error handling.
Because right now I'm getting only NPE and have to guess whats the real reason 
of error.



was (Author: maver1ck):
Only improvement in error handling.



> java.lang.NullPointerException instead of real exception when saving DF to 
> MySQL
> 
>
> Key: SPARK-18022
> URL: https://issues.apache.org/jira/browse/SPARK-18022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Maciej Bryński
>Priority: Minor
>
> Hi,
> I have found following issue.
> When there is an exception while saving dataframe to MySQL I'm unable to get 
> it.
> Instead of I'm getting following stacktrace.
> {code}
> 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID 
> 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a 
> null exception.
> at java.lang.Throwable.addSuppressed(Throwable.java:1046)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The real exception could be for example duplicate on primary key etc.
> With this it's very difficult to debugging apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL

2016-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593846#comment-15593846
 ] 

Maciej Bryński commented on SPARK-18022:


Only improvement in error handling.



> java.lang.NullPointerException instead of real exception when saving DF to 
> MySQL
> 
>
> Key: SPARK-18022
> URL: https://issues.apache.org/jira/browse/SPARK-18022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Maciej Bryński
>Priority: Minor
>
> Hi,
> I have found following issue.
> When there is an exception while saving dataframe to MySQL I'm unable to get 
> it.
> Instead of I'm getting following stacktrace.
> {code}
> 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID 
> 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a 
> null exception.
> at java.lang.Throwable.addSuppressed(Throwable.java:1046)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The real exception could be for example duplicate on primary key etc.
> With this it's very difficult to debugging apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15765) Make continuous Parquet writes consistent with non-continuous Parquet writes

2016-10-20 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593782#comment-15593782
 ] 

Liwei Lin edited comment on SPARK-15765 at 10/21/16 3:07 AM:
-

I'm closing this in favor of SPARK-17924


was (Author: lwlin):
I'm closing this in favor of SPARK-18025

> Make continuous Parquet writes consistent with non-continuous Parquet writes
> 
>
> Key: SPARK-15765
> URL: https://issues.apache.org/jira/browse/SPARK-15765
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin(Inactive)
>
> Currently there are some code duplicates in continuous Parquet writes (as in 
> Structured Streaming) and non-continuous writes; see 
> [ParquetFileFormat#prepareWrite()|(https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L68]
>  and 
> [ParquetFileFormat#ParquetOutputWriterFactory|https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L414].
> This may lead to inconsistent behavior, when we only change one piece of code 
> but not the other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15765) Make continuous Parquet writes consistent with non-continuous Parquet writes

2016-10-20 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593782#comment-15593782
 ] 

Liwei Lin commented on SPARK-15765:
---

I'm closing this in favor of SPARK-18025

> Make continuous Parquet writes consistent with non-continuous Parquet writes
> 
>
> Key: SPARK-15765
> URL: https://issues.apache.org/jira/browse/SPARK-15765
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin(Inactive)
>
> Currently there are some code duplicates in continuous Parquet writes (as in 
> Structured Streaming) and non-continuous writes; see 
> [ParquetFileFormat#prepareWrite()|(https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L68]
>  and 
> [ParquetFileFormat#ParquetOutputWriterFactory|https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L414].
> This may lead to inconsistent behavior, when we only change one piece of code 
> but not the other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15765) Make continuous Parquet writes consistent with non-continuous Parquet writes

2016-10-20 Thread Liwei Lin(Inactive) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin(Inactive) closed SPARK-15765.
---
Resolution: Duplicate

> Make continuous Parquet writes consistent with non-continuous Parquet writes
> 
>
> Key: SPARK-15765
> URL: https://issues.apache.org/jira/browse/SPARK-15765
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin(Inactive)
>
> Currently there are some code duplicates in continuous Parquet writes (as in 
> Structured Streaming) and non-continuous writes; see 
> [ParquetFileFormat#prepareWrite()|(https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L68]
>  and 
> [ParquetFileFormat#ParquetOutputWriterFactory|https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L414].
> This may lead to inconsistent behavior, when we only change one piece of code 
> but not the other.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-20 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593760#comment-15593760
 ] 

Liwei Lin commented on SPARK-16845:
---

Oh thanks for the feedback; it's helpful!

The branch you're testing against is one way to fix this, and there's also an 
alternative way -- we're still discussing which would be better. I think this 
shall get merged in possibly after Spark Summit Europe. Thanks!

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17829) Stable format for offset log

2016-10-20 Thread Tyson Condie (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593706#comment-15593706
 ] 

Tyson Condie commented on SPARK-17829:
--

Had a conversation with Michael about how to offset serialization. When 
considering deserialization, the following three options seem possible.
1. Ask the source to deserialize the string into an offset (object).
2. Follow a formatting convention e.g., first line identifies an offset 
implementation class that accepts a string constructor argument; the string 
that is passed to the constructor comes from the second line.
3. Get rid of the Offset trait entirely and only deal with strings. This seems 
reasonable since we do not need to compare two offsets; we only care about the 
source's understanding of the offset, which it can interpret from whatever it 
embeds in the string e.g., like option 2. 



> Stable format for offset log
> 
>
> Key: SPARK-17829
> URL: https://issues.apache.org/jira/browse/SPARK-17829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Tyson Condie
>
> Currently we use java serialization for the WAL that stores the offsets 
> contained in each batch.  This has two main issues:
>  - It can break across spark releases (though this is not the only thing 
> preventing us from upgrading a running query)
>  - It is unnecessarily opaque to the user.
> I'd propose we require offsets to provide a user readable serialization and 
> use that instead.  JSON is probably a good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17891) SQL-based three column join loses first column

2016-10-20 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593685#comment-15593685
 ] 

Yuming Wang commented on SPARK-17891:
-

*Workaround:*
# Disable BroadcastHashJoin  by setting 
{{spark.sql.autoBroadcastJoinThreshold=-1}}
# Convert join keys to StringType 

> SQL-based three column join loses first column
> --
>
> Key: SPARK-17891
> URL: https://issues.apache.org/jira/browse/SPARK-17891
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Eli Miller
> Attachments: test.tgz
>
>
> Hi all,
> I hope that this is not a known issue (I haven't had any luck finding 
> anything similar in Jira or the mailing lists but I could be searching with 
> the wrong terms). I just started to experiment with Spark SQL and am seeing 
> what appears to be a bug. When using Spark SQL to join two tables with a 
> three column inner join, the first column join is ignored. The example code 
> that I have starts with two tables *T1*:
> {noformat}
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  1|  2|  3|  4|
> +---+---+---+---+
> {noformat}
> and *T2*:
> {noformat}
> +---+---+---+---+
> |  b|  c|  d|  e|
> +---+---+---+---+
> |  2|  3|  4|  5|
> | -2|  3|  4|  6|
> |  2| -3|  4|  7|
> +---+---+---+---+
> {noformat}
> Joining *T1* to *T2* on *b*, *c* and *d* (in that order):
> {code:sql}
> SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
> FROM t1, t2
> WHERE t1.b = t2.b AND t1.c = t2.c AND t1.d = t2.d
> {code}
> results in the following (note that *T1.b* != *T2.b* in the first row):
> {noformat}
> +---+---+---+---+---+---+---+---+
> |  a|  b|  b|  c|  c|  d|  d|  e|
> +---+---+---+---+---+---+---+---+
> |  1|  2| -2|  3|  3|  4|  4|  6|
> |  1|  2|  2|  3|  3|  4|  4|  5|
> +---+---+---+---+---+---+---+---+
> {noformat}
> Switching the predicate order to *c*, *b* and *d*:
> {code:sql}
> SELECT t1.a, t1.b, t2.b, t1.c,t2.c, t1.d, t2.d, t2.e
> FROM t1, t2
> WHERE t1.c = t2.c AND t1.b = t2.b AND t1.d = t2.d
> {code}
> yields different results (now *T1.c* != *T2.c* in the first row):
> {noformat}
> +---+---+---+---+---+---+---+---+
> |  a|  b|  b|  c|  c|  d|  d|  e|
> +---+---+---+---+---+---+---+---+
> |  1|  2|  2|  3| -3|  4|  4|  7|
> |  1|  2|  2|  3|  3|  4|  4|  5|
> +---+---+---+---+---+---+---+---+
> {noformat}
> Is this expected?
> I started to research this a bit and one thing that jumped out at me was the 
> ordering of the HashedRelationBroadcastMode concatenation in the plan (this 
> is from the *b*, *c*, *d* ordering):
> {noformat}
> ...
> *Project [a#0, b#1, b#9, c#2, c#10, d#3, d#11, e#12]
> +- *BroadcastHashJoin [b#1, c#2, d#3], [b#9, c#10, d#11], Inner, BuildRight
>:- *Project [a#0, b#1, c#2, d#3]
>:  +- *Filter ((isnotnull(b#1) && isnotnull(c#2)) && isnotnull(d#3))
>: +- *Scan csv [a#0,b#1,c#2,d#3] Format: CSV, InputPaths: 
> file:/home/eli/git/IENG/what/target/classes/t1.csv, PartitionFilters: [], 
> PushedFilters: [IsNotNull(b), IsNotNull(c), IsNotNull(d)], ReadSchema: 
> struct
>+- BroadcastExchange 
> HashedRelationBroadcastMode(List((shiftleft((shiftleft(cast(input[0, int, 
> true] as bigint), 32) | (cast(input[1, int, true] as bigint) & 4294967295)), 
> 32) | (cast(input[2, int, true] as bigint) & 4294967295
>   +- *Project [b#9, c#10, d#11, e#12]
>  +- *Filter ((isnotnull(c#10) && isnotnull(b#9)) && isnotnull(d#11))
> +- *Scan csv [b#9,c#10,d#11,e#12] Format: CSV, InputPaths: 
> file:/home/eli/git/IENG/what/target/classes/t2.csv, PartitionFilters: [], 
> PushedFilters: [IsNotNull(c), IsNotNull(b), IsNotNull(d)], ReadSchema: 
> struct]
> {noformat}
> If this concatenated byte array is ever truncated to 64 bits in a comparison, 
> the leading column will be lost and could result in this behavior.
> I will attach my example code and data. Please let me know if I can provide 
> any other details.
> Best regards,
> Eli



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-882) Have link for feedback/suggestions in docs

2016-10-20 Thread Deron Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593573#comment-15593573
 ] 

Deron Eriksson edited comment on SPARK-882 at 10/21/16 1:37 AM:


I don't see any activity, so mind if I take a crack at this for the Spark 
documentation (link to open a pre-populated minor doc JIRA)?

cc [~pwendell] [~srowen]



was (Author: deron):
I don't see any activity, so mind if I take a crack at this for the Spark 
documentation (link to open a pre-populated minor doc JIRA)?

I think this is a great idea so I just implemented it for SystemML 
(http://apache.github.io/incubator-systemml/ under Issue Tracking on the top 
nav).

cc [~pwendell] [~srowen]


> Have link for feedback/suggestions in docs
> --
>
> Key: SPARK-882
> URL: https://issues.apache.org/jira/browse/SPARK-882
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Patrick Cogan
>
> It would be cool to have a link at the top of the docs for 
> feedback/suggestions/errors. I bet we'd get a lot of interesting stuff from 
> that and it could be a good way to crowdsource correctness checking, since a 
> lot of us that write them never have to use them.
> Something to the right of the main top nav might be good. [~andyk] [~matei] - 
> what do you guys think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children

2016-10-20 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593583#comment-15593583
 ] 

Reynold Xin commented on SPARK-18038:
-

It definitely does.


> Move output partitioning definition from UnaryNodeExec to its children
> --
>
> Key: SPARK-18038
> URL: https://issues.apache.org/jira/browse/SPARK-18038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Tejas Patil
>Priority: Trivial
>
> This was a suggestion by [~rxin] over one of the dev list discussion : 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html
> {noformat}
> I think this is very risky because preserving output partitioning should not 
> be a property of UnaryNodeExec (e.g. exchange).
> It would be better (safer) to move the output partitioning definition into 
> each of the operator and remove it from UnaryExecNode.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-882) Have link for feedback/suggestions in docs

2016-10-20 Thread Deron Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593573#comment-15593573
 ] 

Deron Eriksson commented on SPARK-882:
--

I don't see any activity, so mind if I take a crack at this for the Spark 
documentation (link to open a pre-populated minor doc JIRA)?

I think this is a great idea so I just implemented it for SystemML 
(http://apache.github.io/incubator-systemml/ under Issue Tracking on the top 
nav).

cc [~pwendell] [~srowen]


> Have link for feedback/suggestions in docs
> --
>
> Key: SPARK-882
> URL: https://issues.apache.org/jira/browse/SPARK-882
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Patrick Cogan
>
> It would be cool to have a link at the top of the docs for 
> feedback/suggestions/errors. I bet we'd get a lot of interesting stuff from 
> that and it could be a good way to crowdsource correctness checking, since a 
> lot of us that write them never have to use them.
> Something to the right of the main top nav might be good. [~andyk] [~matei] - 
> what do you guys think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7146) Should ML sharedParams be a public API?

2016-10-20 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593539#comment-15593539
 ] 

Joseph K. Bradley edited comment on SPARK-7146 at 10/21/16 12:53 AM:
-

Update: We may need to make Java interfaces for these, rather than expecting 
users to depend upon the Scala traits.  There's a Java binary compatibility 
issue which surfaced in the MiMa upgrade here: 
[https://github.com/apache/spark/pull/15571]

We could probably also expose the corresponding Scala traits since they should 
be safe for Scala users to use outside of Spark.


was (Author: josephkb):
Update: We may need to make Java interfaces for these, rather than expecting 
users to depend upon the Scala traits.  There's a Java binary compatibility 
issue which surfaced in the Mi

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Proposal: Make most of the Param traits in sharedParams.scala public.  Mark 
> them as DeveloperApi.
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> h3. UPDATED proposal
> * Some Params are clearly safe to make public.  We will do so.
> * Some Params could be made public but may require caveats in the trait doc.
> * Some Params have turned out not to be shared in practice.  We can move 
> those Params to the classes which use them.
> *Public shared params*:
> * I/O column params
> ** HasFeaturesCol
> ** HasInputCol
> ** HasInputCols
> ** HasLabelCol
> ** HasOutputCol
> ** HasPredictionCol
> ** HasProbabilityCol
> ** HasRawPredictionCol
> ** HasVarianceCol
> ** HasWeightCol
> * Algorithm settings
> ** HasCheckpointInterval
> ** HasElasticNetParam
> ** HasFitIntercept
> ** HasMaxIter
> ** HasRegParam
> ** HasSeed
> ** HasStandardization (less common)
> ** HasStepSize
> ** HasTol
> *Questionable params*:
> * HasHandleInvalid (only used in StringIndexer, but might be more widely used 
> later on)
> * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but 
> same meaning as Optimizer in LDA)
> *Params to be removed from sharedParams*:
> * HasThreshold (only used in LogisticRegression)
> * HasThresholds (only used in ProbabilisticClassifier)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?

2016-10-20 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593539#comment-15593539
 ] 

Joseph K. Bradley commented on SPARK-7146:
--

Update: We may need to make Java interfaces for these, rather than expecting 
users to depend upon the Scala traits.  There's a Java binary compatibility 
issue which surfaced in the Mi

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Proposal: Make most of the Param traits in sharedParams.scala public.  Mark 
> them as DeveloperApi.
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> h3. UPDATED proposal
> * Some Params are clearly safe to make public.  We will do so.
> * Some Params could be made public but may require caveats in the trait doc.
> * Some Params have turned out not to be shared in practice.  We can move 
> those Params to the classes which use them.
> *Public shared params*:
> * I/O column params
> ** HasFeaturesCol
> ** HasInputCol
> ** HasInputCols
> ** HasLabelCol
> ** HasOutputCol
> ** HasPredictionCol
> ** HasProbabilityCol
> ** HasRawPredictionCol
> ** HasVarianceCol
> ** HasWeightCol
> * Algorithm settings
> ** HasCheckpointInterval
> ** HasElasticNetParam
> ** HasFitIntercept
> ** HasMaxIter
> ** HasRegParam
> ** HasSeed
> ** HasStandardization (less common)
> ** HasStepSize
> ** HasTol
> *Questionable params*:
> * HasHandleInvalid (only used in StringIndexer, but might be more widely used 
> later on)
> * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but 
> same meaning as Optimizer in LDA)
> *Params to be removed from sharedParams*:
> * HasThreshold (only used in LogisticRegression)
> * HasThresholds (only used in ProbabilisticClassifier)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite

2016-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18030:


Assignee: Apache Spark  (was: Tathagata Das)

> Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite 
> -
>
> Key: SPARK-18030
> URL: https://issues.apache.org/jira/browse/SPARK-18030
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite&test_name=when+schema+inference+is+turned+on%2C+should+read+partition+data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite

2016-10-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593493#comment-15593493
 ] 

Apache Spark commented on SPARK-18030:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15577

> Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite 
> -
>
> Key: SPARK-18030
> URL: https://issues.apache.org/jira/browse/SPARK-18030
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Davies Liu
>Assignee: Tathagata Das
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite&test_name=when+schema+inference+is+turned+on%2C+should+read+partition+data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite

2016-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18030:


Assignee: Tathagata Das  (was: Apache Spark)

> Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite 
> -
>
> Key: SPARK-18030
> URL: https://issues.apache.org/jira/browse/SPARK-18030
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Davies Liu
>Assignee: Tathagata Das
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite&test_name=when+schema+inference+is+turned+on%2C+should+read+partition+data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17674) Warnings from SparkR tests being ignored without redirecting to errors

2016-10-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593484#comment-15593484
 ] 

Apache Spark commented on SPARK-17674:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/15576

> Warnings from SparkR tests being ignored without redirecting to errors
> --
>
> Key: SPARK-17674
> URL: https://issues.apache.org/jira/browse/SPARK-17674
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Reporter: Hyukjin Kwon
>
> For example, _currently_ we are having warnings as below:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65905/consoleFull
> {code}
> Warnings 
> ---
> 1. spark.mlp (@test_mllib.R#400) - is.na() applied to non-(list or vector) of 
> type 'NULL'
> 2. spark.mlp (@test_mllib.R#401) - is.na() applied to non-(list or vector) of 
> type 'NULL'
> {code}
> This should be errors as specified in 
> https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L22 
> However, it seems passing the tests fine.
> This seems related with the behaciour in `testhat` library. We should 
> invesigate and fix. This was also discussed in 
> https://github.com/apache/spark/pull/15232



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17674) Warnings from SparkR tests being ignored without redirecting to errors

2016-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17674:


Assignee: Apache Spark

> Warnings from SparkR tests being ignored without redirecting to errors
> --
>
> Key: SPARK-17674
> URL: https://issues.apache.org/jira/browse/SPARK-17674
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> For example, _currently_ we are having warnings as below:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65905/consoleFull
> {code}
> Warnings 
> ---
> 1. spark.mlp (@test_mllib.R#400) - is.na() applied to non-(list or vector) of 
> type 'NULL'
> 2. spark.mlp (@test_mllib.R#401) - is.na() applied to non-(list or vector) of 
> type 'NULL'
> {code}
> This should be errors as specified in 
> https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L22 
> However, it seems passing the tests fine.
> This seems related with the behaciour in `testhat` library. We should 
> invesigate and fix. This was also discussed in 
> https://github.com/apache/spark/pull/15232



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17674) Warnings from SparkR tests being ignored without redirecting to errors

2016-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17674:


Assignee: (was: Apache Spark)

> Warnings from SparkR tests being ignored without redirecting to errors
> --
>
> Key: SPARK-17674
> URL: https://issues.apache.org/jira/browse/SPARK-17674
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Reporter: Hyukjin Kwon
>
> For example, _currently_ we are having warnings as below:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65905/consoleFull
> {code}
> Warnings 
> ---
> 1. spark.mlp (@test_mllib.R#400) - is.na() applied to non-(list or vector) of 
> type 'NULL'
> 2. spark.mlp (@test_mllib.R#401) - is.na() applied to non-(list or vector) of 
> type 'NULL'
> {code}
> This should be errors as specified in 
> https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L22 
> However, it seems passing the tests fine.
> This seems related with the behaciour in `testhat` library. We should 
> invesigate and fix. This was also discussed in 
> https://github.com/apache/spark/pull/15232



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children

2016-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18038:


Assignee: (was: Apache Spark)

> Move output partitioning definition from UnaryNodeExec to its children
> --
>
> Key: SPARK-18038
> URL: https://issues.apache.org/jira/browse/SPARK-18038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Tejas Patil
>Priority: Trivial
>
> This was a suggestion by [~rxin] over one of the dev list discussion : 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html
> {noformat}
> I think this is very risky because preserving output partitioning should not 
> be a property of UnaryNodeExec (e.g. exchange).
> It would be better (safer) to move the output partitioning definition into 
> each of the operator and remove it from UnaryExecNode.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children

2016-10-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593479#comment-15593479
 ] 

Apache Spark commented on SPARK-18038:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/15575

> Move output partitioning definition from UnaryNodeExec to its children
> --
>
> Key: SPARK-18038
> URL: https://issues.apache.org/jira/browse/SPARK-18038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Tejas Patil
>Priority: Trivial
>
> This was a suggestion by [~rxin] over one of the dev list discussion : 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html
> {noformat}
> I think this is very risky because preserving output partitioning should not 
> be a property of UnaryNodeExec (e.g. exchange).
> It would be better (safer) to move the output partitioning definition into 
> each of the operator and remove it from UnaryExecNode.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children

2016-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18038:


Assignee: Apache Spark

> Move output partitioning definition from UnaryNodeExec to its children
> --
>
> Key: SPARK-18038
> URL: https://issues.apache.org/jira/browse/SPARK-18038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Tejas Patil
>Assignee: Apache Spark
>Priority: Trivial
>
> This was a suggestion by [~rxin] over one of the dev list discussion : 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html
> {noformat}
> I think this is very risky because preserving output partitioning should not 
> be a property of UnaryNodeExec (e.g. exchange).
> It would be better (safer) to move the output partitioning definition into 
> each of the operator and remove it from UnaryExecNode.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children

2016-10-20 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593478#comment-15593478
 ] 

Tejas Patil commented on SPARK-18038:
-

Not sure if this deserves a jira but created one. This is a small refactoring 
of code.

> Move output partitioning definition from UnaryNodeExec to its children
> --
>
> Key: SPARK-18038
> URL: https://issues.apache.org/jira/browse/SPARK-18038
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Tejas Patil
>Priority: Trivial
>
> This was a suggestion by [~rxin] over one of the dev list discussion : 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html
> {noformat}
> I think this is very risky because preserving output partitioning should not 
> be a property of UnaryNodeExec (e.g. exchange).
> It would be better (safer) to move the output partitioning definition into 
> each of the operator and remove it from UnaryExecNode.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18038) Move output partitioning definition from UnaryNodeExec to its children

2016-10-20 Thread Tejas Patil (JIRA)
Tejas Patil created SPARK-18038:
---

 Summary: Move output partitioning definition from UnaryNodeExec to 
its children
 Key: SPARK-18038
 URL: https://issues.apache.org/jira/browse/SPARK-18038
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.1
Reporter: Tejas Patil
Priority: Trivial


This was a suggestion by [~rxin] over one of the dev list discussion : 
http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html

{noformat}
I think this is very risky because preserving output partitioning should not be 
a property of UnaryNodeExec (e.g. exchange).
It would be better (safer) to move the output partitioning definition into each 
of the operator and remove it from UnaryExecNode.
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13955) Spark in yarn mode fails

2016-10-20 Thread Tzach Zohar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593453#comment-15593453
 ] 

Tzach Zohar commented on SPARK-13955:
-

[~saisai_shao] can you clarify regarding option #1: when you say

bq. You need to zip all the jars and specify spark.yarn.archive with the path 
of zipped jars

How exactly should that archive look like? 
We're upgrading from 1.6.2 and we keep getting the same error mentioned above:

bq. Error: Could not find or load main class 
org.apache.spark.deploy.yarn.ExecutorLauncher

We've tried using {{spark.yarn.archive}} with:
 - The Spark binary downloaded from the download page (e.g. 
http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0-bin-hadoop2.6.tgz)
 - Creating a {{.zip}} file with the contents of the {{jars/}} folder from the 
downloaded binary
 - Creating a {{.tgz}} file with the contents of the {{jars/}} folder from the 
downloaded binary  
 - All of these options while placing file either on HDFS or locally on driver 
machine

None of these resolve the issue. The only option that actually worked for us 
was the third one you mentioned - setting neither {{spark.yarn.jars}} nor 
{{spark.yarn.archive}} and making sure the right jars exist in 
{{SPARK_HOME/jars}} on each node - but since we run several applications with 
different spark versions and want to simplify our provisioning - this isn't 
convenient for us.

Any clarification would be greatly appreciated, 
Thanks!

> Spark in yarn mode fails
> 
>
> Key: SPARK-13955
> URL: https://issues.apache.org/jira/browse/SPARK-13955
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
>
> I ran spark-shell in yarn client, but from the logs seems the spark assembly 
> jar is not uploaded to HDFS. This may be known issue in the process of 
> SPARK-11157, create this ticket to track this issue. [~vanzin]
> {noformat}
> 16/03/17 17:57:48 INFO Client: Will allocate AM container, with 896 MB memory 
> including 384 MB overhead
> 16/03/17 17:57:48 INFO Client: Setting up container launch context for our AM
> 16/03/17 17:57:48 INFO Client: Setting up the launch environment for our AM 
> container
> 16/03/17 17:57:48 INFO Client: Preparing resources for our AM container
> 16/03/17 17:57:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> 16/03/17 17:57:48 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.10.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.10.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/Users/jzhang/github/spark/lib/apache-rat-0.11.jar -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/apache-rat-0.11.jar
> 16/03/17 17:57:49 INFO Client: Uploading resource 
> file:/private/var/folders/dp/hmchg5dd3vbcvds26q91spdwgp/T/spark-abed04bf-6ac2-448b-91a9-dcc1c401a18f/__spark_conf__4163776487351314654.zip
>  -> 
> hdfs://localhost:9000/user/jzhang/.sparkStaging/application_1458187008455_0006/__spark_conf__4163776487351314654.zip
> 16/03/17 17:57:49 INFO SecurityManager: Changing view acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: Changing modify acls to: jzhang
> 16/03/17 17:57:49 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jzhang); users 
> with modify permissions: Set(jzhang)
> 16/03/17 17:57:49 INFO Client: Submitting application 6 to ResourceManager
> {noformat}
> message in AM container
> {noformat}
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ExecutorLauncher
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18037) Event listener should be aware of multiple tries of same stage

2016-10-20 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593347#comment-15593347
 ] 

Josh Rosen commented on SPARK-18037:


Ahhh, I remember there being other JIRAs related to a negative number of active 
tasks but AFAIK we were never able to reproduce that issue. Thanks for getting 
to the bottom of this! I'll search JIRA and link those older issues here.

> Event listener should be aware of multiple tries of same stage
> --
>
> Key: SPARK-18037
> URL: https://issues.apache.org/jira/browse/SPARK-18037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>
> A stage could be resubmitted before all the task from previous submit had 
> finished, then event listen will mess them up, cause confusing number of 
> active tasks (become negative).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18037) Event listener should be aware of multiple tries of same stage

2016-10-20 Thread Davies Liu (JIRA)
Davies Liu created SPARK-18037:
--

 Summary: Event listener should be aware of multiple tries of same 
stage
 Key: SPARK-18037
 URL: https://issues.apache.org/jira/browse/SPARK-18037
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Davies Liu


A stage could be resubmitted before all the task from previous submit had 
finished, then event listen will mess them up, cause confusing number of active 
tasks (become negative).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18019) Log instrumentation in GBTs

2016-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18019:


Assignee: (was: Apache Spark)

> Log instrumentation in GBTs
> ---
>
> Key: SPARK-18019
> URL: https://issues.apache.org/jira/browse/SPARK-18019
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Sub-task for adding instrumentation to GBTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18019) Log instrumentation in GBTs

2016-10-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593306#comment-15593306
 ] 

Apache Spark commented on SPARK-18019:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/15574

> Log instrumentation in GBTs
> ---
>
> Key: SPARK-18019
> URL: https://issues.apache.org/jira/browse/SPARK-18019
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Sub-task for adding instrumentation to GBTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18019) Log instrumentation in GBTs

2016-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18019:


Assignee: Apache Spark

> Log instrumentation in GBTs
> ---
>
> Key: SPARK-18019
> URL: https://issues.apache.org/jira/browse/SPARK-18019
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>
> Sub-task for adding instrumentation to GBTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-20 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593300#comment-15593300
 ] 

Hyukjin Kwon commented on SPARK-17916:
--

Could I please ask what you think? cc [~felixcheung]

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-20 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593271#comment-15593271
 ] 

Hyukjin Kwon commented on SPARK-17916:
--

Oh, yes sure. I just thought the root problem is to differentiate {{""}}. Once 
we can distinguish it, we can easily transform it. Also, another point I want 
to make was.. we already have a great reference in R but it seems not handling 
this case.


> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18036) Decision Trees do not handle edge cases

2016-10-20 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-18036:


 Summary: Decision Trees do not handle edge cases
 Key: SPARK-18036
 URL: https://issues.apache.org/jira/browse/SPARK-18036
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Reporter: Seth Hendrickson
Priority: Minor


Decision trees/GBT/RF do not handle edge cases such as constant features or 
empty features. For example:

{code}
val dt = new DecisionTreeRegressor()
val data = Seq(LabeledPoint(1.0, Vectors.dense(Array.empty[Double]))).toDF()
dt.fit(data)

java.lang.UnsupportedOperationException: empty.max
  at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
  at scala.collection.mutable.ArrayOps$ofInt.max(ArrayOps.scala:234)
  at 
org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:207)
  at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105)
  at 
org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:93)
  at 
org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:46)
  at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
  ... 52 elided

{code}

as well as 

{code}
val dt = new DecisionTreeRegressor()
val data = Seq(LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))).toDF()
dt.fit(data)

java.lang.UnsupportedOperationException: empty.maxBy
at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236)
at scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37)
at 
org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:846)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15777) Catalog federation

2016-10-20 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593229#comment-15593229
 ] 

Yan commented on SPARK-15777:
-

One approach could be first tagging a subtree as specific to a data source, and 
then only applying the custom rules from that data source to the subtree so 
tagged. There could be other feasible approaches, and it is considered one of 
the details left open for future discussions. Thanks.

> Catalog federation
> --
>
> Key: SPARK-15777
> URL: https://issues.apache.org/jira/browse/SPARK-15777
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: SparkFederationDesign.pdf
>
>
> This is a ticket to track progress to support federating multiple external 
> catalogs. This would require establishing an API (similar to the current 
> ExternalCatalog API) for getting information about external catalogs, and 
> ability to convert a table into a data source table.
> As part of this, we would also need to be able to support more than a 
> two-level table identifier (database.table). At the very least we would need 
> a three level identifier for tables (catalog.database.table). A possibly 
> direction is to support arbitrary level hierarchical namespaces similar to 
> file systems.
> Once we have this implemented, we can convert the current Hive catalog 
> implementation into an external catalog that is "mounted" into an internal 
> catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18035) Unwrapping java maps in HiveInspectors allocates unnecessary buffer

2016-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18035:


Assignee: Apache Spark

> Unwrapping java maps in HiveInspectors allocates unnecessary buffer
> ---
>
> Key: SPARK-18035
> URL: https://issues.apache.org/jira/browse/SPARK-18035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Tejas Patil
>Assignee: Apache Spark
>Priority: Minor
>
> In HiveInspectors, I saw that converting Java map to Spark's 
> `ArrayBasedMapData` spent quite sometime in buffer copying : 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658
> The reason being `map.toSeq` allocates a new buffer and copies the map 
> entries to it: 
> https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323
> This copy is not needed as we get rid of it once we extract the key and value 
> arrays.
> Here is the call trace:
> {noformat}
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664)
> scala.collection.AbstractMap.toSeq(Map.scala:59)
> scala.collection.MapLike$class.toSeq(MapLike.scala:323)
> scala.collection.AbstractMap.toBuffer(Map.scala:59)
> scala.collection.MapLike$class.toBuffer(MapLike.scala:326)
> scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104)
> scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275)
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> scala.collection.Iterator$class.foreach(Iterator.scala:893)
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18035) Unwrapping java maps in HiveInspectors allocates unnecessary buffer

2016-10-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593190#comment-15593190
 ] 

Apache Spark commented on SPARK-18035:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/15573

> Unwrapping java maps in HiveInspectors allocates unnecessary buffer
> ---
>
> Key: SPARK-18035
> URL: https://issues.apache.org/jira/browse/SPARK-18035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Tejas Patil
>Priority: Minor
>
> In HiveInspectors, I saw that converting Java map to Spark's 
> `ArrayBasedMapData` spent quite sometime in buffer copying : 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658
> The reason being `map.toSeq` allocates a new buffer and copies the map 
> entries to it: 
> https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323
> This copy is not needed as we get rid of it once we extract the key and value 
> arrays.
> Here is the call trace:
> {noformat}
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664)
> scala.collection.AbstractMap.toSeq(Map.scala:59)
> scala.collection.MapLike$class.toSeq(MapLike.scala:323)
> scala.collection.AbstractMap.toBuffer(Map.scala:59)
> scala.collection.MapLike$class.toBuffer(MapLike.scala:326)
> scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104)
> scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275)
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> scala.collection.Iterator$class.foreach(Iterator.scala:893)
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18035) Unwrapping java maps in HiveInspectors allocates unnecessary buffer

2016-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18035:


Assignee: (was: Apache Spark)

> Unwrapping java maps in HiveInspectors allocates unnecessary buffer
> ---
>
> Key: SPARK-18035
> URL: https://issues.apache.org/jira/browse/SPARK-18035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Tejas Patil
>Priority: Minor
>
> In HiveInspectors, I saw that converting Java map to Spark's 
> `ArrayBasedMapData` spent quite sometime in buffer copying : 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658
> The reason being `map.toSeq` allocates a new buffer and copies the map 
> entries to it: 
> https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323
> This copy is not needed as we get rid of it once we extract the key and value 
> arrays.
> Here is the call trace:
> {noformat}
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664)
> scala.collection.AbstractMap.toSeq(Map.scala:59)
> scala.collection.MapLike$class.toSeq(MapLike.scala:323)
> scala.collection.AbstractMap.toBuffer(Map.scala:59)
> scala.collection.MapLike$class.toBuffer(MapLike.scala:326)
> scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104)
> scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275)
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> scala.collection.Iterator$class.foreach(Iterator.scala:893)
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18035) Unwrapping java maps in HiveInspectors allocates unnecessary buffer

2016-10-20 Thread Tejas Patil (JIRA)
Tejas Patil created SPARK-18035:
---

 Summary: Unwrapping java maps in HiveInspectors allocates 
unnecessary buffer
 Key: SPARK-18035
 URL: https://issues.apache.org/jira/browse/SPARK-18035
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Tejas Patil
Priority: Minor


In HiveInspectors, I saw that converting Java map to Spark's 
`ArrayBasedMapData` spent quite sometime in buffer copying : 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658

The reason being `map.toSeq` allocates a new buffer and copies the map entries 
to it: 
https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323

This copy is not needed as we get rid of it once we extract the key and value 
arrays.

Here is the call trace:

```
org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664)
scala.collection.AbstractMap.toSeq(Map.scala:59)
scala.collection.MapLike$class.toSeq(MapLike.scala:323)
scala.collection.AbstractMap.toBuffer(Map.scala:59)
scala.collection.MapLike$class.toBuffer(MapLike.scala:326)
scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104)
scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
scala.collection.AbstractIterable.foreach(Iterable.scala:54)
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
scala.collection.Iterator$class.foreach(Iterator.scala:893)
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18035) Unwrapping java maps in HiveInspectors allocates unnecessary buffer

2016-10-20 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated SPARK-18035:

Description: 
In HiveInspectors, I saw that converting Java map to Spark's 
`ArrayBasedMapData` spent quite sometime in buffer copying : 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658

The reason being `map.toSeq` allocates a new buffer and copies the map entries 
to it: 
https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323

This copy is not needed as we get rid of it once we extract the key and value 
arrays.

Here is the call trace:

{noformat}
org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664)
scala.collection.AbstractMap.toSeq(Map.scala:59)
scala.collection.MapLike$class.toSeq(MapLike.scala:323)
scala.collection.AbstractMap.toBuffer(Map.scala:59)
scala.collection.MapLike$class.toBuffer(MapLike.scala:326)
scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104)
scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
scala.collection.AbstractIterable.foreach(Iterable.scala:54)
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
scala.collection.Iterator$class.foreach(Iterator.scala:893)
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
{noformat}

  was:
In HiveInspectors, I saw that converting Java map to Spark's 
`ArrayBasedMapData` spent quite sometime in buffer copying : 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658

The reason being `map.toSeq` allocates a new buffer and copies the map entries 
to it: 
https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323

This copy is not needed as we get rid of it once we extract the key and value 
arrays.

Here is the call trace:

```
org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664)
scala.collection.AbstractMap.toSeq(Map.scala:59)
scala.collection.MapLike$class.toSeq(MapLike.scala:323)
scala.collection.AbstractMap.toBuffer(Map.scala:59)
scala.collection.MapLike$class.toBuffer(MapLike.scala:326)
scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104)
scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
scala.collection.AbstractIterable.foreach(Iterable.scala:54)
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
scala.collection.Iterator$class.foreach(Iterator.scala:893)
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
```


> Unwrapping java maps in HiveInspectors allocates unnecessary buffer
> ---
>
> Key: SPARK-18035
> URL: https://issues.apache.org/jira/browse/SPARK-18035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Tejas Patil
>Priority: Minor
>
> In HiveInspectors, I saw that converting Java map to Spark's 
> `ArrayBasedMapData` spent quite sometime in buffer copying : 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658
> The reason being `map.toSeq` allocates a new buffer and copies the map 
> entries to it: 
> https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323
> This copy is not needed as we get rid of it once we extract the key and value 
> arrays.
> Here is the call trace:
> {noformat}
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664)
> scala.collection.AbstractMap.toSeq(Map.scala:59)
> scala.collection.MapLike$class.toSeq(MapLike.scala:323)
> scala.collection.AbstractMap.toBuffer(Map.scala:59)
> scala.collection.MapLike$class.toBuffer(MapLike.scala:326)
> scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104)
> scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnc

[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-20 Thread Suresh Thalamati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593119#comment-15593119
 ] 

Suresh Thalamati commented on SPARK-17916:
--

Thank you for trying out the different scenarios. I think output you are 
getting after setting he quote to empty is not what is expected in the case. 
You want "" to be recognized as empty string, not actual quotes in the output.

Example (Before my changes on 2.0.1 branch):

input:
col1,col2
1,"-"
2,""
3,
4,"A,B"

val df = spark.read.format("csv").option("nullValue", "\"-\"").option("quote", 
"").option("header", true).load("/Users/suresht/sparktests/emptystring.csv")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]

scala> df.selectExpr("length(col2)").show
++
|length(col2)|
++
|null|
|   2|
|null|
|   2|
++


scala> df.show
+++
|col1|col2|
+++
|   1|null|
|   2|  ""|
|   3|null|
|   4|  "A|
+++





> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL

2016-10-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593008#comment-15593008
 ] 

Dongjoon Hyun commented on SPARK-18022:
---

Or, what you want is just general improvement for error handling in 
`JdbcUtils.scala:256`?

> java.lang.NullPointerException instead of real exception when saving DF to 
> MySQL
> 
>
> Key: SPARK-18022
> URL: https://issues.apache.org/jira/browse/SPARK-18022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Maciej Bryński
>Priority: Minor
>
> Hi,
> I have found following issue.
> When there is an exception while saving dataframe to MySQL I'm unable to get 
> it.
> Instead of I'm getting following stacktrace.
> {code}
> 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID 
> 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a 
> null exception.
> at java.lang.Throwable.addSuppressed(Throwable.java:1046)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The real exception could be for example duplicate on primary key etc.
> With this it's very difficult to debugging apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17829) Stable format for offset log

2016-10-20 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593000#comment-15593000
 ] 

Cody Koeninger commented on SPARK-17829:


At least with regard to kafka offsets, it might be good to keep this the same 
format as in SPARK-17812

> Stable format for offset log
> 
>
> Key: SPARK-17829
> URL: https://issues.apache.org/jira/browse/SPARK-17829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Tyson Condie
>
> Currently we use java serialization for the WAL that stores the offsets 
> contained in each batch.  This has two main issues:
>  - It can break across spark releases (though this is not the only thing 
> preventing us from upgrading a running query)
>  - It is unnecessarily opaque to the user.
> I'd propose we require offsets to provide a user readable serialization and 
> use that instead.  JSON is probably a good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18022) java.lang.NullPointerException instead of real exception when saving DF to MySQL

2016-10-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593005#comment-15593005
 ] 

Dongjoon Hyun commented on SPARK-18022:
---

Hi, [~maver1ck].
Could you give us more information to reproduce this?
Is the table created by Spark? Spark does not create INDEX or CONSTRAINT 
(primary key), does it?

> java.lang.NullPointerException instead of real exception when saving DF to 
> MySQL
> 
>
> Key: SPARK-18022
> URL: https://issues.apache.org/jira/browse/SPARK-18022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Maciej Bryński
>Priority: Minor
>
> Hi,
> I have found following issue.
> When there is an exception while saving dataframe to MySQL I'm unable to get 
> it.
> Instead of I'm getting following stacktrace.
> {code}
> 16/10/20 06:00:35 WARN TaskSetManager: Lost task 56.0 in stage 10.0 (TID 
> 3753, dwh-hn28.adpilot.co): java.lang.NullPointerException: Cannot suppress a 
> null exception.
> at java.lang.Throwable.addSuppressed(Throwable.java:1046)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:256)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:314)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:313)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The real exception could be for example duplicate on primary key etc.
> With this it's very difficult to debugging apps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17829) Stable format for offset log

2016-10-20 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17829:
-
Assignee: Tyson Condie

> Stable format for offset log
> 
>
> Key: SPARK-17829
> URL: https://issues.apache.org/jira/browse/SPARK-17829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Tyson Condie
>
> Currently we use java serialization for the WAL that stores the offsets 
> contained in each batch.  This has two main issues:
>  - It can break across spark releases (though this is not the only thing 
> preventing us from upgrading a running query)
>  - It is unnecessarily opaque to the user.
> I'd propose we require offsets to provide a user readable serialization and 
> use that instead.  JSON is probably a good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-20 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592862#comment-15592862
 ] 

Don Drake commented on SPARK-16845:
---

I compiled your branch and ran my large job and it finished successfully.  

Sorry for the confusion, I wasn't watching the PR, just this JIRA and wasn't 
aware of the changes you were making.

Can this get merged as well as backported to 2.0.x?

Thanks so much.

-Don


> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18034) Upgrade to MiMa 0.1.11

2016-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18034:


Assignee: Josh Rosen  (was: Apache Spark)

> Upgrade to MiMa 0.1.11
> --
>
> Key: SPARK-18034
> URL: https://issues.apache.org/jira/browse/SPARK-18034
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> We should upgrade to the latest release of MiMa (0.1.11) in order to include 
> my fix for a bug which led to flakiness in the MiMa checks 
> (https://github.com/typesafehub/migration-manager/issues/115)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18034) Upgrade to MiMa 0.1.11

2016-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18034:


Assignee: Apache Spark  (was: Josh Rosen)

> Upgrade to MiMa 0.1.11
> --
>
> Key: SPARK-18034
> URL: https://issues.apache.org/jira/browse/SPARK-18034
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> We should upgrade to the latest release of MiMa (0.1.11) in order to include 
> my fix for a bug which led to flakiness in the MiMa checks 
> (https://github.com/typesafehub/migration-manager/issues/115)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18034) Upgrade to MiMa 0.1.11

2016-10-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592845#comment-15592845
 ] 

Apache Spark commented on SPARK-18034:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15571

> Upgrade to MiMa 0.1.11
> --
>
> Key: SPARK-18034
> URL: https://issues.apache.org/jira/browse/SPARK-18034
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> We should upgrade to the latest release of MiMa (0.1.11) in order to include 
> my fix for a bug which led to flakiness in the MiMa checks 
> (https://github.com/typesafehub/migration-manager/issues/115)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2016-10-20 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592839#comment-15592839
 ] 

Reynold Xin commented on SPARK-10915:
-

The current implementation of collect_list isn't going to work very well for 
you. I do think we should create a version of collect_list that spills.


Alternatively, you can do df.repartition().sortWithinPartitions() -- which will 
give you the same thing.

> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18034) Upgrade to MiMa 0.1.11

2016-10-20 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-18034:
--

 Summary: Upgrade to MiMa 0.1.11
 Key: SPARK-18034
 URL: https://issues.apache.org/jira/browse/SPARK-18034
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Josh Rosen
Assignee: Josh Rosen


We should upgrade to the latest release of MiMa (0.1.11) in order to include my 
fix for a bug which led to flakiness in the MiMa checks 
(https://github.com/typesafehub/migration-manager/issues/115)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2016-10-20 Thread Jason White (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592831#comment-15592831
 ] 

Jason White commented on SPARK-10915:
-

At the moment, we use .repartitionAndSortWithinPartitions to give us a strictly 
ordered iterable that we can process one at a time. We don't have a Python list 
sitting in memory, instead we rely on ExternalSort to order in a memory-safe 
way.

I don't yet have enough experience with DataFrames to know if we will have the 
same or similar problems there. It's possible that collect_list will perform 
better - I'll give that a try when we get there and report back on this ticket 
if it's a suitable approach for our use case.

> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2629) Improved state management for Spark Streaming (mapWithState)

2016-10-20 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2629:
-
Summary: Improved state management for Spark Streaming (mapWithState)  
(was: Improved state management for Spark Streaming)

> Improved state management for Spark Streaming (mapWithState)
> 
>
> Key: SPARK-2629
> URL: https://issues.apache.org/jira/browse/SPARK-2629
> Project: Spark
>  Issue Type: Epic
>  Components: Streaming
>Affects Versions: 0.9.2, 1.0.2, 1.2.2, 1.3.1, 1.4.1, 1.5.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.6.0
>
>
>  Current updateStateByKey provides stateful processing in Spark Streaming. It 
> allows the user to maintain per-key state and manage that state using an 
> updateFunction. The updateFunction is called for each key, and it uses new 
> data and existing state of the key, to generate an updated state. However, 
> based on community feedback, we have learnt the following lessons.
> - Need for more optimized state management that does not scan every key
> - Need to make it easier to implement common use cases - (a) timeout of idle 
> data, (b) returning items other than state
> The high level idea that I am proposing is 
> - Introduce a new API -trackStateByKey- *mapWithState* that, allows the user 
> to update per-key state, and emit arbitrary records. The new API is necessary 
> as this will have significantly different semantics than the existing 
> updateStateByKey API. This API will have direct support for timeouts.
> - Internally, the system will keep the state data as a map/list within the 
> partitions of the state RDDs. The new data RDDs will be partitioned 
> appropriately, and for all the key-value data, it will lookup the map/list in 
> the state RDD partition and create a new list/map of updated state data. The 
> new state RDD partition will be created based on the update data and if 
> necessary, with old data. 
> Here is the detailed design doc (*outdated, to be updated*). Please take a 
> look and provide feedback as comments.
> https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18021) Refactor file name specification for data sources

2016-10-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18021.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Refactor file name specification for data sources
> -
>
> Key: SPARK-18021
> URL: https://issues.apache.org/jira/browse/SPARK-18021
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> Currently each data source OutputWriter is responsible for specifying the 
> entire file name for each file output. This, however, does not make any sense 
> because we rely on file name for certain behaviors in Spark SQL, e.g. bucket 
> id. The current approach allows individual data sources to break the 
> implementation of bucketing.
> We don't want to move file name entirely also out of the data sources, 
> because different data sources do want to specify different extensions.
> A good compromise is for the OutputWriter to take in the prefix for a file, 
> and it can add its own suffix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18033) Deprecate TaskContext.partitionId

2016-10-20 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-18033:
--

 Summary: Deprecate TaskContext.partitionId
 Key: SPARK-18033
 URL: https://issues.apache.org/jira/browse/SPARK-18033
 Project: Spark
  Issue Type: Improvement
Reporter: Cody Koeninger


Mark TaskContext.partitionId as deprecated, because it doesn't always reflect 
the physical index at the time the RDD is created.  Add a 
foreachPartitionWithIndex method to mirror the existing mapPartitionsWithIndex 
method.

For background, see

http://apache-spark-developers-list.1001551.n3.nabble.com/PSA-TaskContext-partitionId-the-actual-logical-partition-index-td19524.html





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2016-10-20 Thread Jason White (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592534#comment-15592534
 ] 

Jason White commented on SPARK-10915:
-

That's unfortunate. Materializing a list somewhere is exactly what we're trying 
to avoid. The lists can get unpredictably long for some small number of keys, 
and this approach tends to cause us to blow by our memory ceiling, at least 
when using RDDs. It's why we don't use .groupByKey unless absolutely necessary.

> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2016-10-20 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592544#comment-15592544
 ] 

Reynold Xin commented on SPARK-10915:
-

But if you need strict ordering guarantees, materializing them would be 
necessary, since sorting is a blocking operator.


> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2016-10-20 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592514#comment-15592514
 ] 

Davies Liu commented on SPARK-10915:


[~jason.white] When a aggregate function is applied, the order of input rows is 
not defined (even you have a order by before the aggregate). In case that the 
order matters, you will have to use collect_list and UDF.

> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17999) Add getPreferredLocations for KafkaSourceRDD

2016-10-20 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17999:
-
Assignee: Saisai Shao

> Add getPreferredLocations for KafkaSourceRDD
> 
>
> Key: SPARK-17999
> URL: https://issues.apache.org/jira/browse/SPARK-17999
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> The newly implemented Structured Streaming KafkaSource did calculate the 
> preferred locations for each topic partition, but didn't offer this 
> information through RDD's {{getPreferredLocations}} method. So here propose 
> to add this method in {{KafkaSourceRDD}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17999) Add getPreferredLocations for KafkaSourceRDD

2016-10-20 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17999.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

> Add getPreferredLocations for KafkaSourceRDD
> 
>
> Key: SPARK-17999
> URL: https://issues.apache.org/jira/browse/SPARK-17999
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Reporter: Saisai Shao
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> The newly implemented Structured Streaming KafkaSource did calculate the 
> preferred locations for each topic partition, but didn't offer this 
> information through RDD's {{getPreferredLocations}} method. So here propose 
> to add this method in {{KafkaSourceRDD}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18032) Spark test failed as OOM in jenkins

2016-10-20 Thread Davies Liu (JIRA)
Davies Liu created SPARK-18032:
--

 Summary: Spark test failed as OOM in jenkins
 Key: SPARK-18032
 URL: https://issues.apache.org/jira/browse/SPARK-18032
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Davies Liu
Assignee: Josh Rosen


I saw some tests failed as OOM recently, for example, 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/1998/console#l10n-footer

Maybe we should increase the heapsize, since we are continue to add more stuff 
into Spark/tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18031) Flaky test: org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic functionality

2016-10-20 Thread Davies Liu (JIRA)
Davies Liu created SPARK-18031:
--

 Summary: Flaky test: 
org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic 
functionality
 Key: SPARK-18031
 URL: https://issues.apache.org/jira/browse/SPARK-18031
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Davies Liu


https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite&test_name=basic+functionality



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite

2016-10-20 Thread Davies Liu (JIRA)
Davies Liu created SPARK-18030:
--

 Summary: Flaky test: 
org.apache.spark.sql.streaming.FileStreamSourceSuite 
 Key: SPARK-18030
 URL: https://issues.apache.org/jira/browse/SPARK-18030
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Davies Liu
Assignee: Tathagata Das


https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite&test_name=when+schema+inference+is+turned+on%2C+should+read+partition+data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15687) Columnar execution engine

2016-10-20 Thread Evan Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592429#comment-15592429
 ] 

Evan Chan commented on SPARK-15687:
---

[~kiszk] thanks for the PR... would you mind pointing me to the ColumnarBatch 
Trait/API?I'd like to review that piece of it, but the code review is 
really really really long  :)Thanks

> Columnar execution engine
> -
>
> Key: SPARK-15687
> URL: https://issues.apache.org/jira/browse/SPARK-15687
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> This ticket tracks progress in making the entire engine columnar, especially 
> in the context of nested data type support.
> In Spark 2.0, we have used the internal column batch interface in Parquet 
> reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
> Other parts of the engine are already using whole-stage code generation, 
> which is in many ways more efficient than a columnar execution engine for 
> flat data types.
> The goal here is to figure out a story to work towards making column batch 
> the common data exchange format between operators outside whole-stage code 
> generation, as well as with external systems (e.g. Pandas).
> Some of the important questions to answer are:
> From the architectural perspective: 
> - What is the end state architecture?
> - Should aggregation be columnar?
> - Should sorting be columnar?
> - How do we encode nested data? What are the operations on nested data, and 
> how do we handle these operations in a columnar format?
> - What is the transition plan towards the end state?
> From an external API perspective:
> - Can we expose a more efficient column batch user-defined function API?
> - How do we leverage this to integrate with 3rd party tools?
> - Can we have a spec for a fixed version of the column batch format that can 
> be externalized and use that in data source API v2?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15780) Support mapValues on KeyValueGroupedDataset

2016-10-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15780.
-
   Resolution: Fixed
 Assignee: Koert Kuipers
Fix Version/s: 2.1.0

> Support mapValues on KeyValueGroupedDataset
> ---
>
> Key: SPARK-15780
> URL: https://issues.apache.org/jira/browse/SPARK-15780
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently when doing groupByKey on a Dataset the key ends up in the values 
> which can be clumsy:
> {noformat}
> val ds: Dataset[(K, V)] = ...
> val grouped: KeyValueGroupedDataset[(K, (K, V))] = ds.groupByKey(_._1)
> {noformat}
> With mapValues one can create something more similar to PairRDDFunctions[K, 
> V]:
> {noformat}
> val ds: Dataset[(K, V)] = ...
> val grouped: KeyValueGroupedDataset[(K, V)] = 
> ds.groupByKey(_._1).mapValues(_._2)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17698) Join predicates should not contain filter clauses

2016-10-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17698.
-
   Resolution: Fixed
 Assignee: Tejas Patil
Fix Version/s: 2.1.0

> Join predicates should not contain filter clauses
> -
>
> Key: SPARK-17698
> URL: https://issues.apache.org/jira/browse/SPARK-17698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.1.0
>
>
> `ExtractEquiJoinKeys` is incorrectly using filter predicates as the join 
> condition for joins. While this does not lead to incorrect results but in 
> case of bucketed + sorted tables, we might miss out on avoiding un-necessary 
> shuffle + sort. eg.
> {code}
> val df = (1 until 10).toDF("id").coalesce(1)
> hc.sql("DROP TABLE IF EXISTS table1").collect
> df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table1")
> hc.sql("DROP TABLE IF EXISTS table2").collect
> df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table2")
> sqlContext.sql("""
>   SELECT a.id, b.id
>   FROM table1 a
>   FULL OUTER JOIN table2 b
>   ON a.id = b.id AND a.id='1' AND b.id='1'
> """).explain(true)
> {code}
> This is doing shuffle + sort over table scan outputs which is not needed as 
> both tables are bucketed and sorted on the same columns and have same number 
> of buckets. This should be a single stage job.
> {code}
> SortMergeJoin [id#38, cast(id#38 as double), 1.0], [id#39, 1.0, cast(id#39 as 
> double)], FullOuter
> :- *Sort [id#38 ASC NULLS FIRST, cast(id#38 as double) ASC NULLS FIRST, 1.0 
> ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(id#38, cast(id#38 as double), 1.0, 200)
> : +- *FileScan parquet default.table1[id#38] Batched: true, Format: 
> ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> +- *Sort [id#39 ASC NULLS FIRST, 1.0 ASC NULLS FIRST, cast(id#39 as double) 
> ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#39, 1.0, cast(id#39 as double), 200)
>   +- *FileScan parquet default.table2[id#39] Batched: true, Format: 
> ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17904) Add a wrapper function to install R packages on each executors.

2016-10-20 Thread Piotr Smolinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592272#comment-15592272
 ] 

Piotr Smolinski commented on SPARK-17904:
-

Would it work at all? I have been looking recently on SparkR implementation. 
ATM, on the executor side all dapply/gapply/spark.lappy are single shot 
operations. Executor JVM either forks preallocated small daemon process or 
launches new R runtime (windows or when daemon is explicitly disabled) only for 
duration of the call. This process is immediately disposed once task is done. 
That means there is no R runtime that can be preinitialized. 
Check: https://github.com/apache/spark/blob/master/R/pkg/inst/worker/worker.R

> Add a wrapper function to install R packages on each executors.
> ---
>
> Key: SPARK-17904
> URL: https://issues.apache.org/jira/browse/SPARK-17904
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Yanbo Liang
>
> SparkR provides {{spark.lappy}} to run local R functions in distributed 
> environment, and {{dapply}} to run UDF on SparkDataFrame.
> If users use third-party libraries inside of the function which was passed 
> into {{spark.lappy}} or {{dapply}}, they should install required R packages 
> on each executor in advance.
> To install dependent R packages on each executors and check it successfully, 
> we can run similar code like following:
> (Note: The code is just for example, not the prototype of this proposal. The 
> detail implementation should be discussed.)
> {code}
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), 
> install.packages("Matrix”))
> test <- function(x) { "Matrix" %in% rownames(installed.packages()) }
> rdd <- SparkR:::lapplyPartition(SparkR:::parallelize(sc, 1:2, 2L), test )
> collectRDD(rdd)
> {code}
> It’s cumbersome to run this code snippet each time when you need third-party 
> library, since SparkR is an interactive analytics tools, users may call lots 
> of libraries during the analytics session. In native R, users can run 
> {{install.packages()}} and {{library()}} across the interactive session.
> Should we provide one API to wrapper the work mentioned above, then users can 
> install dependent R packages to each executor easily? 
> I propose the following API:
> {{spark.installPackages(pkgs, repos)}}
> * pkgs: the name of packages. If repos = NULL, this can be set with a 
> local/hdfs path, then SparkR can install packages from local package archives.
> * repos: the base URL(s) of the repositories to use. It can be NULL to 
> install from local directories.
> Since SparkR has its own library directories where to install the packages on 
> each executor, so I think it will not pollute the native R environment. I'd 
> like to know whether it make sense, and feel free to correct me if there is 
> misunderstanding.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17048) ML model read for custom transformers in a pipeline does not work

2016-10-20 Thread Nicolas Long (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592141#comment-15592141
 ] 

Nicolas Long edited comment on SPARK-17048 at 10/20/16 3:32 PM:


I hit this today too. The Scala workaround is simply to create an object of the 
same name that extends DefaultParamsReadable. E.g.

{code:java}
class HtmlRemover(val uid: String) extends StringUnaryTransformer[String, 
HtmlRemover] with DefaultParamsWritable {

  def this() = this(Identifiable.randomUID("htmlremover"))

  def createTransformFunc: String => String = s => {
Jsoup.parse(s).body().text()
  }
}

object HtmlRemover extends DefaultParamsReadable[HtmlRemover]
{code}

But it would be nice to be able to not have to have the singleton object and 
simply add the trait to the transformer itself.

Note that StringUnaryTransformer is a simple custom wrapper trait here.


was (Author: nicl):
I hit this today too. The Scala workaround is simply to create an object of the 
same name that extends DefaultParamsReadable. E.g.

{code:java}
class HtmlRemover(val uid: String) extends StringUnaryTransformer[String, 
HtmlRemover] with DefaultParamsWritable {

  def this() = this(Identifiable.randomUID("htmlremover"))

  def createTransformFunc: String => String = s => {
Jsoup.parse(s).body().text()
  }
}

object HtmlRemover extends DefaultParamsReadable[HtmlRemover]
{code}

Note that StringUnaryTransformer is a simple custom wrapper trait here.

> ML model read for custom transformers in a pipeline does not work 
> --
>
> Key: SPARK-17048
> URL: https://issues.apache.org/jira/browse/SPARK-17048
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
> Java API
>Reporter: Taras Matyashovskyy
>  Labels: easyfix, features
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> 0. Use Java API :( 
> 1. Create any custom ML transformer
> 2. Make it MLReadable and MLWritable
> 3. Add to pipeline
> 4. Evaluate model, e.g. CrossValidationModel, and save results to disk
> 5. For custom transformer you can use DefaultParamsReader and 
> DefaultParamsWriter, for instance 
> 6. Load model from saved directory
> 7. All out-of-the-box objects are loaded successfully, e.g. Pipeline, 
> Evaluator, etc.
> 8. Your custom transformer will fail with NPE
> Reason:
> ReadWrite.scala:447
> cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path)
> In Java this only works for static methods.
> As we are implementing MLReadable or MLWritable, then this call should be 
> instance method call. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17048) ML model read for custom transformers in a pipeline does not work

2016-10-20 Thread Nicolas Long (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592141#comment-15592141
 ] 

Nicolas Long commented on SPARK-17048:
--

I hit this today too. The Scala workaround is simply to create an object of the 
same name that extends DefaultParamsReadable. E.g.

{code:java}
class HtmlRemover(val uid: String) extends StringUnaryTransformer[String, 
HtmlRemover] with DefaultParamsWritable {

  def this() = this(Identifiable.randomUID("htmlremover"))

  def createTransformFunc: String => String = s => {
Jsoup.parse(s).body().text()
  }
}

object HtmlRemover extends DefaultParamsReadable[HtmlRemover]
{code}

Note that StringUnaryTransformer is a simple custom wrapper trait here.

> ML model read for custom transformers in a pipeline does not work 
> --
>
> Key: SPARK-17048
> URL: https://issues.apache.org/jira/browse/SPARK-17048
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
> Environment: Spark 2.0.0
> Java API
>Reporter: Taras Matyashovskyy
>  Labels: easyfix, features
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> 0. Use Java API :( 
> 1. Create any custom ML transformer
> 2. Make it MLReadable and MLWritable
> 3. Add to pipeline
> 4. Evaluate model, e.g. CrossValidationModel, and save results to disk
> 5. For custom transformer you can use DefaultParamsReader and 
> DefaultParamsWriter, for instance 
> 6. Load model from saved directory
> 7. All out-of-the-box objects are loaded successfully, e.g. Pipeline, 
> Evaluator, etc.
> 8. Your custom transformer will fail with NPE
> Reason:
> ReadWrite.scala:447
> cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path)
> In Java this only works for static methods.
> As we are implementing MLReadable or MLWritable, then this call should be 
> instance method call. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15777) Catalog federation

2016-10-20 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592092#comment-15592092
 ] 

Nattavut Sutyanyong commented on SPARK-15777:
-

How do we test that a rule added in one data source implementation will not 
interfere with such a SQL statement referencing objects from both data sources?

> Catalog federation
> --
>
> Key: SPARK-15777
> URL: https://issues.apache.org/jira/browse/SPARK-15777
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: SparkFederationDesign.pdf
>
>
> This is a ticket to track progress to support federating multiple external 
> catalogs. This would require establishing an API (similar to the current 
> ExternalCatalog API) for getting information about external catalogs, and 
> ability to convert a table into a data source table.
> As part of this, we would also need to be able to support more than a 
> two-level table identifier (database.table). At the very least we would need 
> a three level identifier for tables (catalog.database.table). A possibly 
> direction is to support arbitrary level hierarchical namespaces similar to 
> file systems.
> Once we have this implemented, we can convert the current Hive catalog 
> implementation into an external catalog that is "mounted" into an internal 
> catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18029) PruneFileSourcePartitions should not change the output of LogicalRelation

2016-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18029:


Assignee: Apache Spark  (was: Wenchen Fan)

> PruneFileSourcePartitions should not change the output of LogicalRelation
> -
>
> Key: SPARK-18029
> URL: https://issues.apache.org/jira/browse/SPARK-18029
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18029) PruneFileSourcePartitions should not change the output of LogicalRelation

2016-10-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592067#comment-15592067
 ] 

Apache Spark commented on SPARK-18029:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15569

> PruneFileSourcePartitions should not change the output of LogicalRelation
> -
>
> Key: SPARK-18029
> URL: https://issues.apache.org/jira/browse/SPARK-18029
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18029) PruneFileSourcePartitions should not change the output of LogicalRelation

2016-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18029:


Assignee: Wenchen Fan  (was: Apache Spark)

> PruneFileSourcePartitions should not change the output of LogicalRelation
> -
>
> Key: SPARK-18029
> URL: https://issues.apache.org/jira/browse/SPARK-18029
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18029) PruneFileSourcePartitions should not change the output of LogicalRelation

2016-10-20 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-18029:
---

 Summary: PruneFileSourcePartitions should not change the output of 
LogicalRelation
 Key: SPARK-18029
 URL: https://issues.apache.org/jira/browse/SPARK-18029
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9219) ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD

2016-10-20 Thread Nick Orka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592030#comment-15592030
 ] 

Nick Orka commented on SPARK-9219:
--

I've made a CLONE for the JIRA ticket here 
https://issues.apache.org/jira/browse/SPARK-18015

> ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD
> ---
>
> Key: SPARK-9219
> URL: https://issues.apache.org/jira/browse/SPARK-9219
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Mohsen Zainalpour
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 
> (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
>   at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scal

[jira] [Commented] (SPARK-9219) ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD

2016-10-20 Thread Nick Orka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592041#comment-15592041
 ] 

Nick Orka commented on SPARK-9219:
--

I'm using IntelliJ Idea. Here is whole dependency tree (IML file)
{code:xml}


  



  
  
  
  








































































































































  

{code}


> ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD
> ---
>
> Key: SPARK-9219
> URL: https://issues.apache.org/jira/browse/SPARK-9219
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Mohsen Zainalpour
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 
> (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
>   at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.

[jira] [Issue Comment Deleted] (SPARK-9219) ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD

2016-10-20 Thread Nick Orka (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Orka updated SPARK-9219:
-
Comment: was deleted

(was: I've made a CLONE for the JIRA ticket here 
https://issues.apache.org/jira/browse/SPARK-18015)

> ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD
> ---
>
> Key: SPARK-9219
> URL: https://issues.apache.org/jira/browse/SPARK-9219
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Mohsen Zainalpour
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 
> (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
>   at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.ref

[jira] [Updated] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2016-10-20 Thread Aleksander Eskilson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksander Eskilson updated SPARK-18016:

Description: 
When attempting to encode collections of large Java objects to Datasets having 
very wide or deeply nested schemas, code generation can fail, yielding:

{code}
Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
 has grown past JVM limit of 0x
at 
org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
at 
org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
at 
org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
at 
org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
at 
org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
at 
org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
at 
org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
at 
org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
at 
org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
at 
org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
at 
org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
at 
org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
at 
org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
at 
org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196)
at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:91)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:905)
... 35 more
{code}

During generation of the code for Spec

[jira] [Comment Edited] (SPARK-17131) Code generation fails when running SQL expressions against a wide dataset (thousands of columns)

2016-10-20 Thread Aleksander Eskilson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592000#comment-15592000
 ] 

Aleksander Eskilson edited comment on SPARK-17131 at 10/20/16 2:45 PM:
---

Yeah, that makes sense. So far, what I documented and this one seem to have 
been the only JIRAs that exhibit specifically the Constant Pool limit error. 
I'm trying to dig deeper into it to see if it really marks its own class of 
error, but given that SPARK-17702 didn't resolve the error case I posted (even 
though it splits up sections of large generated code), I do suspect they are, 
quite related, but ultimately different issues. I think the splitExpressions 
technique that was used in SPARK-17702 and that also appears to be being 
employed in SPARK-16845 could be useful for the range of different classes that 
can generate too many lines of code. Seeing the issues linked together is 
definitely useful.

To that end, I'll leave mine resolved as a duplicate of SPARK-16845 for now 
until I can make use of the patch it develops, so we can see more conclusively 
if they're related issues, or truly duplicates. And I'll link the two "0x" 
issues together as related.


was (Author: aeskilson):
Yeah, that makes sense. So far, what I documented and this one seem to have 
been the only JIRAs that exhibit specifically the Constant Pool limit error. 
I'm trying to dig deeper into it to see if it really marks its own class of 
error, but given that SPARK-17702 didn't resolve the error case I posted (even 
though it splits up sections of large generated code), I do suspect they are, 
quite related, but ultimately different issues. I think the spliExpressions 
technique that was used in SPARK-17702 and that also appears to be being 
employed in SPARK-16845 could be useful for the range of different classes that 
can generate too many lines of code. Seeing the issues linked together is 
definitely useful.

To that end, I'll leave mine resolved as a duplicate of SPARK-16845 for now 
until I can make use of the patch it develops, so we can see more conclusively 
if they're related issues, or truly duplicates. And I'll link the two "0x" 
issues together as related.

> Code generation fails when running SQL expressions against a wide dataset 
> (thousands of columns)
> 
>
> Key: SPARK-17131
> URL: https://issues.apache.org/jira/browse/SPARK-17131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Iaroslav Zeigerman
> Attachments: 
> _SPARK_17131__add_a_test_case_with_1000_column_DF_where_describe___fails.patch
>
>
> When reading the CSV file that contains 1776 columns Spark and Janino fail to 
> generate the code with message:
> {noformat}
> Constant pool has grown past JVM limit of 0x
> {noformat}
> When running a common select with all columns it's fine:
> {code}
>   val allCols = df.columns.map(c => col(c).as(c + "_alias"))
>   val newDf = df.select(allCols: _*)
>   newDf.show()
> {code}
> But when I invoke the describe method:
> {code}
> newDf.describe(allCols: _*)
> {code}
> it fails with the following stack trace:
> {noformat}
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   ... 30 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool has 
> grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:402)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantIntegerInfo(ClassFile.java:300)
>   at 
> org.codehaus.janino.UnitCompiler.addConstantIntegerInfo(UnitCompiler.java:10307)
>   at org.codehaus.janino.UnitCompiler.pushConstant(UnitCompiler.java:8868)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4346)
>   at org.codehaus.janino.UnitCompiler.access$7100(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitIntegerLiteral(UnitCompiler.java:3265)
>   at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at org.codehaus.janino.UnitCompiler.fakeCompile(Unit

[jira] [Commented] (SPARK-17131) Code generation fails when running SQL expressions against a wide dataset (thousands of columns)

2016-10-20 Thread Aleksander Eskilson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592000#comment-15592000
 ] 

Aleksander Eskilson commented on SPARK-17131:
-

Yeah, that makes sense. So far, what I documented and this one seem to have 
been the only JIRAs that exhibit specifically the Constant Pool limit error. 
I'm trying to dig deeper into it to see if it really marks its own class of 
error, but given that SPARK-17702 didn't resolve the error case I posted (even 
though it splits up sections of large generated code), I do suspect they are, 
quite related, but ultimately different issues. I think the spliExpressions 
technique that was used in SPARK-17702 and that also appears to be being 
employed in SPARK-16845 could be useful for the range of different classes that 
can generate too many lines of code. Seeing the issues linked together is 
definitely useful.

To that end, I'll leave mine resolved as a duplicate of SPARK-16845 for now 
until I can make use of the patch it develops, so we can see more conclusively 
if they're related issues, or truly duplicates. And I'll link the two "0x" 
issues together as related.

> Code generation fails when running SQL expressions against a wide dataset 
> (thousands of columns)
> 
>
> Key: SPARK-17131
> URL: https://issues.apache.org/jira/browse/SPARK-17131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Iaroslav Zeigerman
> Attachments: 
> _SPARK_17131__add_a_test_case_with_1000_column_DF_where_describe___fails.patch
>
>
> When reading the CSV file that contains 1776 columns Spark and Janino fail to 
> generate the code with message:
> {noformat}
> Constant pool has grown past JVM limit of 0x
> {noformat}
> When running a common select with all columns it's fine:
> {code}
>   val allCols = df.columns.map(c => col(c).as(c + "_alias"))
>   val newDf = df.select(allCols: _*)
>   newDf.show()
> {code}
> But when I invoke the describe method:
> {code}
> newDf.describe(allCols: _*)
> {code}
> it fails with the following stack trace:
> {noformat}
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   ... 30 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool has 
> grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:402)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantIntegerInfo(ClassFile.java:300)
>   at 
> org.codehaus.janino.UnitCompiler.addConstantIntegerInfo(UnitCompiler.java:10307)
>   at org.codehaus.janino.UnitCompiler.pushConstant(UnitCompiler.java:8868)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4346)
>   at org.codehaus.janino.UnitCompiler.access$7100(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitIntegerLiteral(UnitCompiler.java:3265)
>   at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at org.codehaus.janino.UnitCompiler.fakeCompile(UnitCompiler.java:2605)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4362)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3975)
>   at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2662)
>   at org.codehaus.janino.UnitCompiler.access$4400(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$7.visitMethodInvocation(UnitCompiler.java:2627)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654)
>   at org.codehaus.janino.UnitCompiler.compile2(Un

[jira] [Commented] (SPARK-17131) Code generation fails when running SQL expressions against a wide dataset (thousands of columns)

2016-10-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15591953#comment-15591953
 ] 

Sean Owen commented on SPARK-17131:
---

OK well I think it's fine to leave one copy of the "0x" issue open if you 
have any reasonable reason to suspect it's different, and just link the JIRAs. 
I suppose I was mostly saying this could just be reopened, and separately, 
there are a lot of real duplicates of similar issues out there too, making it 
hard to figure out what the underlying unique issues are.

> Code generation fails when running SQL expressions against a wide dataset 
> (thousands of columns)
> 
>
> Key: SPARK-17131
> URL: https://issues.apache.org/jira/browse/SPARK-17131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Iaroslav Zeigerman
> Attachments: 
> _SPARK_17131__add_a_test_case_with_1000_column_DF_where_describe___fails.patch
>
>
> When reading the CSV file that contains 1776 columns Spark and Janino fail to 
> generate the code with message:
> {noformat}
> Constant pool has grown past JVM limit of 0x
> {noformat}
> When running a common select with all columns it's fine:
> {code}
>   val allCols = df.columns.map(c => col(c).as(c + "_alias"))
>   val newDf = df.select(allCols: _*)
>   newDf.show()
> {code}
> But when I invoke the describe method:
> {code}
> newDf.describe(allCols: _*)
> {code}
> it fails with the following stack trace:
> {noformat}
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   ... 30 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool has 
> grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:402)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantIntegerInfo(ClassFile.java:300)
>   at 
> org.codehaus.janino.UnitCompiler.addConstantIntegerInfo(UnitCompiler.java:10307)
>   at org.codehaus.janino.UnitCompiler.pushConstant(UnitCompiler.java:8868)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4346)
>   at org.codehaus.janino.UnitCompiler.access$7100(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitIntegerLiteral(UnitCompiler.java:3265)
>   at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at org.codehaus.janino.UnitCompiler.fakeCompile(UnitCompiler.java:2605)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4362)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3975)
>   at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2662)
>   at org.codehaus.janino.UnitCompiler.access$4400(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$7.visitMethodInvocation(UnitCompiler.java:2627)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643)
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2016-10-20 Thread Aleksander Eskilson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksander Eskilson updated SPARK-18016:

Description: 
When attempting to encode collections of large Java objects to Datasets having 
very wide or deeply nested schemas, code generation can fail, yielding:

{code}
Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
 has grown past JVM limit of 0x
at 
org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
at 
org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
at 
org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
at 
org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
at 
org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
at 
org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
at 
org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
at 
org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
at 
org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
at 
org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
at 
org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
at 
org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
at 
org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
at 
org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196)
at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:91)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:905)
... 35 more
{code}

During generation of the code for Spec

[jira] [Updated] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2016-10-20 Thread Aleksander Eskilson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksander Eskilson updated SPARK-18016:

Summary: Code Generation: Constant Pool Past Limit for Wide/Nested Dataset  
(was: Code Generation Fails When Encoding Large Object to Wide Dataset)

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(

[jira] [Resolved] (SPARK-18016) Code Generation Fails When Encoding Large Object to Wide Dataset

2016-10-20 Thread Aleksander Eskilson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksander Eskilson resolved SPARK-18016.
-
Resolution: Duplicate

> Code Generation Fails When Encoding Large Object to Wide Dataset
> 
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>

[jira] [Commented] (SPARK-18016) Code Generation Fails When Encoding Large Object to Wide Dataset

2016-10-20 Thread Aleksander Eskilson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15591814#comment-15591814
 ] 

Aleksander Eskilson commented on SPARK-18016:
-

As per some discussion in SPARK-17131, marking this issue as a potential 
duplicate of SPARK-16845 so we can see if its resolution solves the same issue 
and we can track in what way these bugs may be related. We can reopen if 
necessary.

> Code Generation Fails When Encoding Large Object to Wide Dataset
> 
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>  

[jira] [Commented] (SPARK-17131) Code generation fails when running SQL expressions against a wide dataset (thousands of columns)

2016-10-20 Thread Aleksander Eskilson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15591810#comment-15591810
 ] 

Aleksander Eskilson commented on SPARK-17131:
-

Sure, I apologize for that. I'll also mark it as a duplicate of SPARK-16845 and 
monitor its pull-request to see if it resolves the issue I opened.

> Code generation fails when running SQL expressions against a wide dataset 
> (thousands of columns)
> 
>
> Key: SPARK-17131
> URL: https://issues.apache.org/jira/browse/SPARK-17131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Iaroslav Zeigerman
> Attachments: 
> _SPARK_17131__add_a_test_case_with_1000_column_DF_where_describe___fails.patch
>
>
> When reading the CSV file that contains 1776 columns Spark and Janino fail to 
> generate the code with message:
> {noformat}
> Constant pool has grown past JVM limit of 0x
> {noformat}
> When running a common select with all columns it's fine:
> {code}
>   val allCols = df.columns.map(c => col(c).as(c + "_alias"))
>   val newDf = df.select(allCols: _*)
>   newDf.show()
> {code}
> But when I invoke the describe method:
> {code}
> newDf.describe(allCols: _*)
> {code}
> it fails with the following stack trace:
> {noformat}
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   ... 30 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool has 
> grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:402)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantIntegerInfo(ClassFile.java:300)
>   at 
> org.codehaus.janino.UnitCompiler.addConstantIntegerInfo(UnitCompiler.java:10307)
>   at org.codehaus.janino.UnitCompiler.pushConstant(UnitCompiler.java:8868)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4346)
>   at org.codehaus.janino.UnitCompiler.access$7100(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitIntegerLiteral(UnitCompiler.java:3265)
>   at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at org.codehaus.janino.UnitCompiler.fakeCompile(UnitCompiler.java:2605)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4362)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3975)
>   at org.codehaus.janino.UnitCompiler.access$6900(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitMethodInvocation(UnitCompiler.java:3263)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2662)
>   at org.codehaus.janino.UnitCompiler.access$4400(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$7.visitMethodInvocation(UnitCompiler.java:2627)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:3974)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643)
> 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18028) simplify TableFileCatalog

2016-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18028:


Assignee: Apache Spark  (was: Wenchen Fan)

> simplify TableFileCatalog
> -
>
> Key: SPARK-18028
> URL: https://issues.apache.org/jira/browse/SPARK-18028
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18028) simplify TableFileCatalog

2016-10-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15591794#comment-15591794
 ] 

Apache Spark commented on SPARK-18028:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15568

> simplify TableFileCatalog
> -
>
> Key: SPARK-18028
> URL: https://issues.apache.org/jira/browse/SPARK-18028
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >