[jira] [Commented] (SPARK-20310) Dependency convergence error for scala-xml

2017-04-12 Thread Samik R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967146#comment-15967146
 ] 

Samik R commented on SPARK-20310:
-

Hi Sean,

Thanks for your comments. You are probably thinking that I am using scala 
v2.11.0 based on the line ["+-org.scala-lang:scalap:2.11.0"], but this is 
coming from the jackson-json4s dependency of spark-core. I am actually on 
2.11.8 on the box and have that scala version as library dependency in the pom 
file as well. 

I also agree, this may not actually cause a problem (reason why I thought this 
is a minor bug). I was just planning to set the dependency to the latest 
version and hope things work fine. 

Thanks.
-Samik

> Dependency convergence error for scala-xml
> --
>
> Key: SPARK-20310
> URL: https://issues.apache.org/jira/browse/SPARK-20310
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Samik R
>Priority: Minor
>
> Hi,
> I am trying to compile a package (apache tinkerpop) which has spark-core as 
> one of the dependencies. I am trying to compile with v2.1.0. But when I run 
> maven build through a dependency checker, it is showing a dependency error 
> within the spark-core itself for scala-xml package, as below:
> Dependency convergence error for org.scala-lang.modules:scala-xml_2.11:1.0.1 
> paths to dependency are:
> +-org.apache.tinkerpop:spark-gremlin:3.2.3
>   +-org.apache.spark:spark-core_2.11:2.1.0
> +-org.json4s:json4s-jackson_2.11:3.2.11
>   +-org.json4s:json4s-core_2.11:3.2.11
> +-org.scala-lang:scalap:2.11.0
>   +-org.scala-lang:scala-compiler:2.11.0
> +-org.scala-lang.modules:scala-xml_2.11:1.0.1
> and
> +-org.apache.tinkerpop:spark-gremlin:3.2.3
>   +-org.apache.spark:spark-core_2.11:2.1.0
> +-org.apache.spark:spark-tags_2.11:2.1.0
>   +-org.scalatest:scalatest_2.11:2.2.6
> +-org.scala-lang.modules:scala-xml_2.11:1.0.2
> Can this be fixed?
> Thanks.
> -Samik



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20316) In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax

2017-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967145#comment-15967145
 ] 

Apache Spark commented on SPARK-20316:
--

User 'ouyangxiaochen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17628

> In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax
> -
>
> Key: SPARK-20316
> URL: https://issues.apache.org/jira/browse/SPARK-20316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark2.1.0
>Reporter: Xiaochen Ouyang
>
> In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax.
>   private var prompt = "spark-sql"
>   private var continuedPrompt = "".padTo(prompt.length, ' ')
> if there is no place to change the variable.We should use 'val' to modify the 
> variable,otherwise 'var'.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20316) In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax

2017-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20316:


Assignee: Apache Spark

> In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax
> -
>
> Key: SPARK-20316
> URL: https://issues.apache.org/jira/browse/SPARK-20316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark2.1.0
>Reporter: Xiaochen Ouyang
>Assignee: Apache Spark
>
> In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax.
>   private var prompt = "spark-sql"
>   private var continuedPrompt = "".padTo(prompt.length, ' ')
> if there is no place to change the variable.We should use 'val' to modify the 
> variable,otherwise 'var'.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20316) In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax

2017-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20316:


Assignee: (was: Apache Spark)

> In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax
> -
>
> Key: SPARK-20316
> URL: https://issues.apache.org/jira/browse/SPARK-20316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark2.1.0
>Reporter: Xiaochen Ouyang
>
> In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax.
>   private var prompt = "spark-sql"
>   private var continuedPrompt = "".padTo(prompt.length, ' ')
> if there is no place to change the variable.We should use 'val' to modify the 
> variable,otherwise 'var'.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20316) In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax

2017-04-12 Thread Xiaochen Ouyang (JIRA)
Xiaochen Ouyang created SPARK-20316:
---

 Summary: In SparkSQLCLIDriver, val and var should strictly follow 
the Scala syntax
 Key: SPARK-20316
 URL: https://issues.apache.org/jira/browse/SPARK-20316
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
 Environment: Spark2.1.0
Reporter: Xiaochen Ouyang


In SparkSQLCLIDriver, val and var should strictly follow the Scala syntax.
  private var prompt = "spark-sql"
  private var continuedPrompt = "".padTo(prompt.length, ' ')

if there is no place to change the variable.We should use 'val' to modify the 
variable,otherwise 'var'.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19924) Handle InvocationTargetException for all Hive Shim

2017-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967142#comment-15967142
 ] 

Apache Spark commented on SPARK-19924:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17627

> Handle InvocationTargetException for all Hive Shim
> --
>
> Key: SPARK-19924
> URL: https://issues.apache.org/jira/browse/SPARK-19924
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> Since we are using shim for most Hive metastore APIs, the exceptions thrown 
> by the underlying method of Method.invoke() are wrapped by 
> `InvocationTargetException`. Instead of doing it one by one, we should handle 
> all of them in the `withClient`. If any of them is missing, the error message 
> could looks unfriendly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-12 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966973#comment-15966973
 ] 

Stephane Maarek commented on SPARK-20287:
-

[~c...@koeninger.org] 
How about using the subscribe pattern?
https://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html

```
public void subscribe(Collection topics)
Subscribe to the given list of topics to get dynamically assigned partitions. 
Topic subscriptions are not incremental. This list will replace the current 
assignment (if there is one). It is not possible to combine topic subscription 
with group management with manual partition assignment through 
assign(Collection). If the given list of topics is empty, it is treated the 
same as unsubscribe().
```

Then you let Kafka handle the partition assignments? As all the consumers share 
the same group.id, the data will be effectively distributed between every Spark 
instance?

But then I guess you may have already explored that option and it goes against 
the Spark DirectStream API? (not a Spark expert, just trying to understand the 
limitations. I believe you when you say you did it the most straightforward way)

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20131) Flaky Test: o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly

2017-04-12 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20131.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.2.0
   2.1.1

> Flaky Test: o.a.s.streaming.StreamingContextSuite.SPARK-18560 Receiver data 
> should be deserialized properly
> ---
>
> Key: SPARK-20131
> URL: https://issues.apache.org/jira/browse/SPARK-20131
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Takuya Ueshin
>Assignee: Shixiong Zhu
>Priority: Minor
>  Labels: flaky-test
> Fix For: 2.1.1, 2.2.0
>
>
> This test failed recently here:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/2861/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/SPARK_18560_Receiver_data_should_be_deserialized_properly_/
> Dashboard
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.StreamingContextSuite_name=SPARK-18560+Receiver+data+should+be+deserialized+properly.
> Error Message
> {code}
> latch.await(60L, SECONDS) was false
> {code}
> {code}
> org.scalatest.exceptions.TestFailedException: latch.await(60L, SECONDS) was 
> false
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply$mcV$sp(StreamingContextSuite.scala:837)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$43.apply(StreamingContextSuite.scala:810)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.streaming.StreamingContextSuite.org$scalatest$BeforeAndAfter$$super$runTest(StreamingContextSuite.scala:44)
>   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
>   at 
> org.apache.spark.streaming.StreamingContextSuite.runTest(StreamingContextSuite.scala:44)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at 
> 

[jira] [Commented] (SPARK-20036) impossible to read a whole kafka topic using kafka 0.10 and spark 2.0.0

2017-04-12 Thread Daniel Nuriyev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966939#comment-15966939
 ] 

Daniel Nuriyev commented on SPARK-20036:


Thank you, Cody. I will do as you say and report what happens.

> impossible to read a whole kafka topic using kafka 0.10 and spark 2.0.0 
> 
>
> Key: SPARK-20036
> URL: https://issues.apache.org/jira/browse/SPARK-20036
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Daniel Nuriyev
> Attachments: Main.java, pom.xml
>
>
> I use kafka 0.10.1 and java code with the following dependencies:
> 
> org.apache.kafka
> kafka_2.11
> 0.10.1.1
> 
> 
> org.apache.kafka
> kafka-clients
> 0.10.1.1
> 
> 
> org.apache.spark
> spark-streaming_2.11
> 2.0.0
> 
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.0
> 
> The code tries to read the whole topic using:
> kafkaParams.put("auto.offset.reset", "earliest");
> Using 5 second batches:
> jssc = new JavaStreamingContext(conf, Durations.seconds(5));
> Each batch returns empty.
> I debugged the code I noticed that KafkaUtils.fixKafkaParams is called that 
> overrides "earliest" with "none".
> Whether this is related or not, when I used kafka 0.8 on the client with 
> kafka 0.10.1 on the server, I could read the whole topic.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20315) Set ScalaUDF's deterministic to true

2017-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20315:


Assignee: Apache Spark  (was: Xiao Li)

> Set ScalaUDF's deterministic to true
> 
>
> Key: SPARK-20315
> URL: https://issues.apache.org/jira/browse/SPARK-20315
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> ScalaUDF is always assumed to deterministic. However, the current master 
> still set it based on the children's deterministic values. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20315) Set ScalaUDF's deterministic to true

2017-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20315:


Assignee: Xiao Li  (was: Apache Spark)

> Set ScalaUDF's deterministic to true
> 
>
> Key: SPARK-20315
> URL: https://issues.apache.org/jira/browse/SPARK-20315
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> ScalaUDF is always assumed to deterministic. However, the current master 
> still set it based on the children's deterministic values. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20315) Set ScalaUDF's deterministic to true

2017-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966938#comment-15966938
 ] 

Apache Spark commented on SPARK-20315:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17626

> Set ScalaUDF's deterministic to true
> 
>
> Key: SPARK-20315
> URL: https://issues.apache.org/jira/browse/SPARK-20315
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> ScalaUDF is always assumed to deterministic. However, the current master 
> still set it based on the children's deterministic values. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20315) Set ScalaUDF's deterministic to true

2017-04-12 Thread Xiao Li (JIRA)
Xiao Li created SPARK-20315:
---

 Summary: Set ScalaUDF's deterministic to true
 Key: SPARK-20315
 URL: https://issues.apache.org/jira/browse/SPARK-20315
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2
Reporter: Xiao Li
Assignee: Xiao Li


ScalaUDF is always assumed to deterministic. However, the current master still 
set it based on the children's deterministic values. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter

2017-04-12 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966926#comment-15966926
 ] 

Yan Facai (颜发才) commented on SPARK-20199:
-

It's not hard, and I can work on it.

However, there are two possible solutions:

1. add `setFeatureSubsetStrategy` method to DecisionTree. So for GBT, it create 
an DecesionTree by using the method. 
code like  `val dt = new 
DecisionTreeRegressor().setFeatureSubsetStrategy(xxx)`.

2. add `featureSubsetStrategy` param for `train` method of DecesionTree. 
minimum changes.

which one is better? I prefer to the first.

> GradientBoostedTreesModel doesn't have  Column Sampling Rate Paramenter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20314) Inconsistent error handling in JSON parsing SQL functions

2017-04-12 Thread Eric Wasserman (JIRA)
Eric Wasserman created SPARK-20314:
--

 Summary: Inconsistent error handling in JSON parsing SQL functions
 Key: SPARK-20314
 URL: https://issues.apache.org/jira/browse/SPARK-20314
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Eric Wasserman


Most parse errors in the JSON parsing SQL functions (e.g. json_tuple, 
get_json_object) will return a null(s) if the JSON is badly formed. However, if 
Jackson determines that the string includes invalid characters it will throw an 
exception (java.io.CharConversionException: Invalid UTF-32 character) that 
Spark does not catch. This creates a robustness problem in that these functions 
cannot be used at all when there may be dirty data as these exceptions will 
kill the jobs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1809) Mesos backend doesn't respect HADOOP_CONF_DIR

2017-04-12 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966782#comment-15966782
 ] 

Andrew Ash edited comment on SPARK-1809 at 4/12/17 11:00 PM:
-

I'm not using Mesos anymore, so closing


was (Author: aash):
Not using Mesos anymore, so closing

> Mesos backend doesn't respect HADOOP_CONF_DIR
> -
>
> Key: SPARK-1809
> URL: https://issues.apache.org/jira/browse/SPARK-1809
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Andrew Ash
>
> In order to use HDFS paths without the server component, standalone mode 
> reads spark-env.sh and scans the HADOOP_CONF_DIR to open core-site.xml and 
> get the fs.default.name parameter.
> This lets you use HDFS paths like:
> - hdfs:///tmp/myfile.txt
> instead of
> - hdfs://myserver.mydomain.com:8020/tmp/myfile.txt
> However as of recent 1.0.0 pre-release (hash 756c96) I had to specify HDFS 
> paths with the full server even though I have HADOOP_CONF_DIR still set in 
> spark-env.sh.  The HDFS, Spark, and Mesos nodes are all co-located and 
> non-domain HDFS paths work fine when using the standalone mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1809) Mesos backend doesn't respect HADOOP_CONF_DIR

2017-04-12 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash closed SPARK-1809.
-
Resolution: Unresolved

Not using Mesos anymore, so closing

> Mesos backend doesn't respect HADOOP_CONF_DIR
> -
>
> Key: SPARK-1809
> URL: https://issues.apache.org/jira/browse/SPARK-1809
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Andrew Ash
>
> In order to use HDFS paths without the server component, standalone mode 
> reads spark-env.sh and scans the HADOOP_CONF_DIR to open core-site.xml and 
> get the fs.default.name parameter.
> This lets you use HDFS paths like:
> - hdfs:///tmp/myfile.txt
> instead of
> - hdfs://myserver.mydomain.com:8020/tmp/myfile.txt
> However as of recent 1.0.0 pre-release (hash 756c96) I had to specify HDFS 
> paths with the full server even though I have HADOOP_CONF_DIR still set in 
> spark-env.sh.  The HDFS, Spark, and Mesos nodes are all co-located and 
> non-domain HDFS paths work fine when using the standalone mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9103) Tracking spark's memory usage

2017-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966736#comment-15966736
 ] 

Apache Spark commented on SPARK-9103:
-

User 'jsoltren' has created a pull request for this issue:
https://github.com/apache/spark/pull/17625

> Tracking spark's memory usage
> -
>
> Key: SPARK-9103
> URL: https://issues.apache.org/jira/browse/SPARK-9103
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Reporter: Zhang, Liye
> Attachments: Tracking Spark Memory Usage - Phase 1.pdf
>
>
> Currently spark only provides little memory usage information (RDD cache on 
> webUI) for the executors. User have no idea on what is the memory consumption 
> when they are running spark applications with a lot of memory used in spark 
> executors. Especially when they encounter the OOM, it’s really hard to know 
> what is the cause of the problem. So it would be helpful to give out the 
> detail memory consumption information for each part of spark, so that user 
> can clearly have a picture of where the memory is exactly used. 
> The memory usage info to expose should include but not limited to shuffle, 
> cache, network, serializer, etc.
> User can optionally choose to open this functionality since this is mainly 
> for debugging and tuning.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2017-04-12 Thread Jacques Nadeau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966385#comment-15966385
 ] 

Jacques Nadeau commented on SPARK-13534:


Great, thanks [~holdenk]!

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Wes McKinney
> Attachments: benchmark.py
>
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2017-04-12 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966339#comment-15966339
 ] 

holdenk commented on SPARK-13534:
-

So I'm following along with the progress on this, I'll try and take a more 
thorough look this Thursday.

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Wes McKinney
> Attachments: benchmark.py
>
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20301) Flakiness in StreamingAggregationSuite

2017-04-12 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-20301.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

> Flakiness in StreamingAggregationSuite
> --
>
> Key: SPARK-20301
> URL: https://issues.apache.org/jira/browse/SPARK-20301
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>  Labels: flaky-test
> Fix For: 2.2.0
>
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.StreamingAggregationSuite



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2017-04-12 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966300#comment-15966300
 ] 

Tejas Patil commented on SPARK-20184:
-

Out of curiosity, I tried out a query with ~20 columns (similar to your repro) 
and ran with and without codegen. Input table was 20 TB with 320 billion rows. 
Codegen was 23% faster. In my case, the query ran for more than an hour. This 
was with Spark 2.0 release and not trunk.

When I tried the exact same repro in the jira over my box (this time with 
trunk), with and without codegen didn't show difference. 


> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19570) Allow to disable hive in pyspark shell

2017-04-12 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-19570.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Allow to disable hive in pyspark shell
> --
>
> Key: SPARK-19570
> URL: https://issues.apache.org/jira/browse/SPARK-19570
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Minor
> Fix For: 2.2.0
>
>
> SPARK-15236 do this for scala shell, this ticket is for pyspark shell.  This 
> is not only for pyspark itself, but can also benefit downstream project like 
> livy which use shell.py for its interactive session. For now, livy has no 
> control of whether enable hive or not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19570) Allow to disable hive in pyspark shell

2017-04-12 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-19570:
---

Assignee: Jeff Zhang

> Allow to disable hive in pyspark shell
> --
>
> Key: SPARK-19570
> URL: https://issues.apache.org/jira/browse/SPARK-19570
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Minor
> Fix For: 2.2.0
>
>
> SPARK-15236 do this for scala shell, this ticket is for pyspark shell.  This 
> is not only for pyspark itself, but can also benefit downstream project like 
> livy which use shell.py for its interactive session. For now, livy has no 
> control of whether enable hive or not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2017-04-12 Thread Jacques Nadeau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966231#comment-15966231
 ] 

Jacques Nadeau commented on SPARK-13534:


Anybody know some committers we can get to look at this?

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Wes McKinney
> Attachments: benchmark.py
>
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19976) DirectStream API throws OffsetOutOfRange Exception

2017-04-12 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966208#comment-15966208
 ] 

Cody Koeninger commented on SPARK-19976:


What would your expected behavior be when you delete data out of kafka before 
the stream has read it?

My expectation would be that it fails.

> DirectStream API throws OffsetOutOfRange Exception
> --
>
> Key: SPARK-19976
> URL: https://issues.apache.org/jira/browse/SPARK-19976
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Taukir
>
> I am using following code. While data on kafka topic get deleted/retention 
> period is over, it throws Exception and application crash
> def functionToCreateContext(sc:SparkContext):StreamingContext = {
> val kafkaParams = new mutable.HashMap[String, Object]()
> kafkaParams.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaBrokers)
> kafkaParams.put(ConsumerConfig.GROUP_ID_CONFIG, KafkaConsumerGrp)
> kafkaParams.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, 
> classOf[StringDeserializer])
> kafkaParams.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, 
> classOf[StringDeserializer])
> kafkaParams.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG,"true")
> kafkaParams.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
>val consumerStrategy = ConsumerStrategies.Subscribe[String, 
> String](topic.split(",").map(_.trim).filter(!_.isEmpty).toSet, kafkaParams)
> val kafkaStream  = 
> KafkaUtils.createDirectStream(ssc,LocationStrategies.PreferConsistent,consumerStrategy)
> }
> spark throws error and crash once OffsetOutOf RangeException  is thrown
> WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 : 
> org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of 
> range with no configured reset policy for partitions: {test-2=127287}
> at 
> org.apache.kafka.clients.consumer.internals.Fetcher.parseFetchedData(Fetcher.java:588)
> at 
> org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:354)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1000)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
> at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:98)
> at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
> at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
> at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15354) Topology aware block replication strategies

2017-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966203#comment-15966203
 ] 

Apache Spark commented on SPARK-15354:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17624

> Topology aware block replication strategies
> ---
>
> Key: SPARK-15354
> URL: https://issues.apache.org/jira/browse/SPARK-15354
> Project: Spark
>  Issue Type: Sub-task
>  Components: Mesos, Spark Core, YARN
>Reporter: Shubham Chopra
>Assignee: Shubham Chopra
> Fix For: 2.2.0
>
>
> Implementations of strategies for resilient block replication for different 
> resource managers that replicate the 3-replica strategy used by HDFS, where 
> the first replica is on an executor, the second replica within the same rack 
> as the executor and a third replica on a different rack. 
> The implementation involves providing two pluggable classes, one running in 
> the driver that provides topology information for every host at cluster start 
> and the second prioritizing a list of peer BlockManagerIds. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-12 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966201#comment-15966201
 ] 

holdenk commented on SPARK-20202:
-

Oh right, sorry I was misreading the intent of Affects Version/s.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20037) impossible to set kafka offsets using kafka 0.10 and spark 2.0.0

2017-04-12 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966199#comment-15966199
 ] 

Cody Koeninger commented on SPARK-20037:


I'd be inclined to say this is a duplicate of the issue in SPARK-20036, and 
would start investigating the same way,i.e. remove the explicit dependencies on 
org.apache.kafka artifacts.

> impossible to set kafka offsets using kafka 0.10 and spark 2.0.0
> 
>
> Key: SPARK-20037
> URL: https://issues.apache.org/jira/browse/SPARK-20037
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Daniel Nuriyev
> Attachments: Main.java, offsets.png
>
>
> I use kafka 0.10.1 and java code with the following dependencies:
> 
> org.apache.kafka
> kafka_2.11
> 0.10.1.1
> 
> 
> org.apache.kafka
> kafka-clients
> 0.10.1.1
> 
> 
> org.apache.spark
> spark-streaming_2.11
> 2.0.0
> 
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.0
> 
> The code tries to read the a topic starting with offsets. 
> The topic has 4 partitions that start somewhere before 585000 and end after 
> 674000. So I wanted to read all partitions starting with 585000
> fromOffsets.put(new TopicPartition(topic, 0), 585000L);
> fromOffsets.put(new TopicPartition(topic, 1), 585000L);
> fromOffsets.put(new TopicPartition(topic, 2), 585000L);
> fromOffsets.put(new TopicPartition(topic, 3), 585000L);
> Using 5 second batches:
> jssc = new JavaStreamingContext(conf, Durations.seconds(5));
> The code immediately throws:
> Beginning offset 585000 is after the ending offset 584464 for topic 
> commerce_item_expectation partition 1
> It does not make sense because this topic/partition starts at 584464, not ends
> I use this as a base: 
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> But I use direct stream:
> KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(),
> ConsumerStrategies.Subscribe(
> topics, kafkaParams, fromOffsets
> )
> )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20036) impossible to read a whole kafka topic using kafka 0.10 and spark 2.0.0

2017-04-12 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966193#comment-15966193
 ] 

Cody Koeninger commented on SPARK-20036:


fixKafkaParams is related to executor consumers, not the driver consumer.  In a 
nutshell, the executor should read the offsets the driver tells it to, not auto 
reset.

Remove the explicit dependencies in your pom on org.apache.kafka artifacts.  
spark-streaming-kafka-0-10 has appropriate transitive dependencies.

> impossible to read a whole kafka topic using kafka 0.10 and spark 2.0.0 
> 
>
> Key: SPARK-20036
> URL: https://issues.apache.org/jira/browse/SPARK-20036
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Daniel Nuriyev
> Attachments: Main.java, pom.xml
>
>
> I use kafka 0.10.1 and java code with the following dependencies:
> 
> org.apache.kafka
> kafka_2.11
> 0.10.1.1
> 
> 
> org.apache.kafka
> kafka-clients
> 0.10.1.1
> 
> 
> org.apache.spark
> spark-streaming_2.11
> 2.0.0
> 
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.0
> 
> The code tries to read the whole topic using:
> kafkaParams.put("auto.offset.reset", "earliest");
> Using 5 second batches:
> jssc = new JavaStreamingContext(conf, Durations.seconds(5));
> Each batch returns empty.
> I debugged the code I noticed that KafkaUtils.fixKafkaParams is called that 
> overrides "earliest" with "none".
> Whether this is related or not, when I used kafka 0.8 on the client with 
> kafka 0.10.1 on the server, I could read the whole topic.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-12 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966180#comment-15966180
 ] 

Cody Koeninger commented on SPARK-20287:


The issue here is that the underlying new Kafka consumer api doesn't have a way 
for a single consumer to subscribe to multiple partitions, but only read a 
particular range of messages from one of them.

The max capacity is just a simple way of dealing with what is basically a LRU 
cache - if someone creates topics dynamically and then stops sending messages 
to them, you don't want to keep leaking resources.

I'm not claiming there's anything great or elegant about those solutions, but 
they were pretty much the most straightforward way to make the direct stream 
model work with the new kafka consumer api.

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20291) NaNvl(FloatType, NullType) should not be cast to NaNvl(DoubleType, DoubleType)

2017-04-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20291:

Fix Version/s: 2.0.3

> NaNvl(FloatType, NullType) should not be cast to NaNvl(DoubleType, 
> DoubleType) 
> ---
>
> Key: SPARK-20291
> URL: https://issues.apache.org/jira/browse/SPARK-20291
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> `NaNvl(float value, null)` will be converted into `NaNvl(float value, 
> Cast(null, DoubleType))` and finally `NaNvl(Cast(float value, DoubleType), 
> Cast(null, DoubleType))`.
> This will cause mismatching in the output type when the input type is float.
> By adding extra rule in TypeCoercion can resolve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20313) Possible lack of join optimization when partitions are in the join condition

2017-04-12 Thread Albert Meltzer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Albert Meltzer updated SPARK-20313:
---
Description: 
Given two tables T1 and T2, partitioned on column part1, the following have 
vastly different execution performance:

// initial, slow
{noformat}
val df1 = // load data from T1
  .filter(functions.col("part1").between("val1", "val2")
val df2 = // load data from T2
  .filter(functions.col("part1").between("val1", "val2")
val df3 = df1.join(df2, Seq("part1", "col1"))
{noformat}

// manually optimized, considerably faster
{noformat}
val df1 = // load data from T1
val df2 = // load data from T2
val part1values = Seq(...) // a collection of values between val1 and val2
val df3 = part1values
  .map(part1value => {
val df1filtered = df1.filter(functions.col("part1") === part1value)
val df2filtered = df2.filter(functions.col("part1") === part1value)
df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join
  })
  .reduce(_ union _)
{noformat}

  was:
Given two tables T1 and T2, partitioned on column part1, the following have 
vastly different execution performance:

// initial, slow
{noformat}
val df1 = // load data from T1
  .filter(functions.col("part1").between("val1", "val2")
val df2 = // load data from T2
  .filter(functions.col("part1").between("val1", "val2")
val df3 = df1.join(df2, Seq("part1", "col1"))
{noformat}

// manually optimized, considerably faster
{noformat}
val df1 = // load data from T1
val df2 = // load data from T2
val part1values = Seq(...) // a collection of values between val1 and val2
val df3 = part1values
  .map(part1value => {
val df3 = df1.join(df2, Seq("part1", "col1"))
val df1filtered = df1.filter(functions.col("part1") === part1value)
val df2filtered = df2.filter(functions.col("part1") === part1value)
df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join
  })
  .reduce(_ union _)
{noformat}


> Possible lack of join optimization when partitions are in the join condition
> 
>
> Key: SPARK-20313
> URL: https://issues.apache.org/jira/browse/SPARK-20313
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Albert Meltzer
>
> Given two tables T1 and T2, partitioned on column part1, the following have 
> vastly different execution performance:
> // initial, slow
> {noformat}
> val df1 = // load data from T1
>   .filter(functions.col("part1").between("val1", "val2")
> val df2 = // load data from T2
>   .filter(functions.col("part1").between("val1", "val2")
> val df3 = df1.join(df2, Seq("part1", "col1"))
> {noformat}
> // manually optimized, considerably faster
> {noformat}
> val df1 = // load data from T1
> val df2 = // load data from T2
> val part1values = Seq(...) // a collection of values between val1 and val2
> val df3 = part1values
>   .map(part1value => {
> val df1filtered = df1.filter(functions.col("part1") === part1value)
> val df2filtered = df2.filter(functions.col("part1") === part1value)
> df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join
>   })
>   .reduce(_ union _)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20313) Possible lack of join optimization when partitions are in the join condition

2017-04-12 Thread Albert Meltzer (JIRA)
Albert Meltzer created SPARK-20313:
--

 Summary: Possible lack of join optimization when partitions are in 
the join condition
 Key: SPARK-20313
 URL: https://issues.apache.org/jira/browse/SPARK-20313
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 2.1.0
Reporter: Albert Meltzer


Given two tables T1 and T2, partitioned on column part1, the following have 
vastly different execution performance:

// initial
{noformat}
val df1 = // load data from T1
  .filter(functions.col("part1").between("val1", "val2")
val df2 = // load data from T2
  .filter(functions.col("part1").between("val1", "val2")
val df3 = df1.join(df2, Seq("part1", "col1"))
{noformat}

// manually optimized
{noformat}
val df1 = // load data from T1
val df2 = // load data from T2
val part1values = Seq(...) // a collection of values between val1 and val2
val df3 = part1values
  .map(part1value => {
val df3 = df1.join(df2, Seq("part1", "col1"))
val df1filtered = df1.filter(functions.col("part1") === part1value)
val df2filtered = df2.filter(functions.col("part1") === part1value)
df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join
  })
  .reduce(_ union _)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20313) Possible lack of join optimization when partitions are in the join condition

2017-04-12 Thread Albert Meltzer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Albert Meltzer updated SPARK-20313:
---
Description: 
Given two tables T1 and T2, partitioned on column part1, the following have 
vastly different execution performance:

// initial, slow
{noformat}
val df1 = // load data from T1
  .filter(functions.col("part1").between("val1", "val2")
val df2 = // load data from T2
  .filter(functions.col("part1").between("val1", "val2")
val df3 = df1.join(df2, Seq("part1", "col1"))
{noformat}

// manually optimized, considerably faster
{noformat}
val df1 = // load data from T1
val df2 = // load data from T2
val part1values = Seq(...) // a collection of values between val1 and val2
val df3 = part1values
  .map(part1value => {
val df3 = df1.join(df2, Seq("part1", "col1"))
val df1filtered = df1.filter(functions.col("part1") === part1value)
val df2filtered = df2.filter(functions.col("part1") === part1value)
df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join
  })
  .reduce(_ union _)
{noformat}

  was:
Given two tables T1 and T2, partitioned on column part1, the following have 
vastly different execution performance:

// initial
{noformat}
val df1 = // load data from T1
  .filter(functions.col("part1").between("val1", "val2")
val df2 = // load data from T2
  .filter(functions.col("part1").between("val1", "val2")
val df3 = df1.join(df2, Seq("part1", "col1"))
{noformat}

// manually optimized
{noformat}
val df1 = // load data from T1
val df2 = // load data from T2
val part1values = Seq(...) // a collection of values between val1 and val2
val df3 = part1values
  .map(part1value => {
val df3 = df1.join(df2, Seq("part1", "col1"))
val df1filtered = df1.filter(functions.col("part1") === part1value)
val df2filtered = df2.filter(functions.col("part1") === part1value)
df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join
  })
  .reduce(_ union _)
{noformat}


> Possible lack of join optimization when partitions are in the join condition
> 
>
> Key: SPARK-20313
> URL: https://issues.apache.org/jira/browse/SPARK-20313
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Albert Meltzer
>
> Given two tables T1 and T2, partitioned on column part1, the following have 
> vastly different execution performance:
> // initial, slow
> {noformat}
> val df1 = // load data from T1
>   .filter(functions.col("part1").between("val1", "val2")
> val df2 = // load data from T2
>   .filter(functions.col("part1").between("val1", "val2")
> val df3 = df1.join(df2, Seq("part1", "col1"))
> {noformat}
> // manually optimized, considerably faster
> {noformat}
> val df1 = // load data from T1
> val df2 = // load data from T2
> val part1values = Seq(...) // a collection of values between val1 and val2
> val df3 = part1values
>   .map(part1value => {
> val df3 = df1.join(df2, Seq("part1", "col1"))
> val df1filtered = df1.filter(functions.col("part1") === part1value)
> val df2filtered = df2.filter(functions.col("part1") === part1value)
> df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join
>   })
>   .reduce(_ union _)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20304) AssertNotNull should not include path in string representation

2017-04-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20304.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.2

> AssertNotNull should not include path in string representation
> --
>
> Key: SPARK-20304
> URL: https://issues.apache.org/jira/browse/SPARK-20304
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.2, 2.2.0
>
>
> AssertNotNull's toString/simpleString dumps the entire walkedTypePath. 
> walkedTypePath is used for error message reporting and shouldn't be part of 
> the output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20303) Rename createTempFunction to registerFunction

2017-04-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20303.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Rename createTempFunction to registerFunction
> -
>
> Key: SPARK-20303
> URL: https://issues.apache.org/jira/browse/SPARK-20303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> Session catalog API `createTempFunction` is being used by Hive build-in 
> functions, persistent functions, and temporary functions. Thus, the name is 
> confusing. This PR is to replace it by `registerFunction`. Also we can move 
> construction of `FunctionBuilder` and `ExpressionInfo` into the new 
> `registerFunction`, instead of duplicating the logics everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20312) query optimizer calls udf with null values when it doesn't expect them

2017-04-12 Thread Albert Meltzer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Albert Meltzer updated SPARK-20312:
---
Description: 
When optimizing an outer join, spark passes an empty row to both sides to see 
if nulls would be ignored (side comment: for half-outer joins it subsequently 
ignores the assessment on the dominant side).

For some reason, a condition such as {{xx IS NOT NULL && udf(xx) IS NOT NULL}} 
might result in checking the right side first, and an exception if the udf 
doesn't expect a null input (given the left side first).

A example is SIMILAR to the following (see actual query plans separately):

{noformat}
def func(value: Any): Int = ... // return AnyVal which probably causes a IS NOT 
NULL added filter on the result

val df1 = sparkSession
  .table(...)
  .select("col1", "col2") // LongType both
val df11 = df1
  .filter(df1("col1").isNotNull)
  .withColumn("col3", functions.udf(func)(df1("col1"))
  .repartition(df1("col2"))
  .sortWithinPartitions(df1("col2"))

val df2 = ... // load other data containing col2, similarly repartition and sort

val df3 =
  df1.join(df2, Seq("col2"), "left_outer")
{noformat}


  was:
When optimizing an outer join, spark passes an empty row to both sides to see 
if nulls would be ignored (side comment: for half-outer joins it subsequently 
ignores the assessment on the dominant side).

For some reason, a condition such as "x IS NOT NULL && udf(x) IS NOT NULL" 
might result in checking the right side first, and an exception if the udf 
doesn't expect a null input (given the left side first).

A example is SIMILAR to the following (see actual query plans separately):

def func(value: Any): Int = ... // return AnyVal which probably causes a IS NOT 
NULL added filter on the result

val df1 = sparkSession
  .table(...)
  .select("col1", "col2") // LongType both
val df11 = df1
  .filter(df1("col1").isNotNull)
  .withColumn("col3", functions.udf(func)(df1("col1"))
  .repartition(df1("col2"))
  .sortWithinPartitions(df1("col2"))

val df2 = ... // load other data containing col2, similarly repartition and sort

val df3 =
  df1.join(df2, Seq("col2"), "left_outer")


> query optimizer calls udf with null values when it doesn't expect them
> --
>
> Key: SPARK-20312
> URL: https://issues.apache.org/jira/browse/SPARK-20312
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Albert Meltzer
>
> When optimizing an outer join, spark passes an empty row to both sides to see 
> if nulls would be ignored (side comment: for half-outer joins it subsequently 
> ignores the assessment on the dominant side).
> For some reason, a condition such as {{xx IS NOT NULL && udf(xx) IS NOT 
> NULL}} might result in checking the right side first, and an exception if the 
> udf doesn't expect a null input (given the left side first).
> A example is SIMILAR to the following (see actual query plans separately):
> {noformat}
> def func(value: Any): Int = ... // return AnyVal which probably causes a IS 
> NOT NULL added filter on the result
> val df1 = sparkSession
>   .table(...)
>   .select("col1", "col2") // LongType both
> val df11 = df1
>   .filter(df1("col1").isNotNull)
>   .withColumn("col3", functions.udf(func)(df1("col1"))
>   .repartition(df1("col2"))
>   .sortWithinPartitions(df1("col2"))
> val df2 = ... // load other data containing col2, similarly repartition and 
> sort
> val df3 =
>   df1.join(df2, Seq("col2"), "left_outer")
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20312) query optimizer calls udf with null values when it doesn't expect them

2017-04-12 Thread Albert Meltzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966094#comment-15966094
 ] 

Albert Meltzer commented on SPARK-20312:


During query optimization, one of the subtrees becomes:

{noformat}
Project [col4#0L, col3#5L, col2#10L, col1#11L, 2016-11-11 AS version#59, 
col5#45]
+- Repartition 8, true
   +- Project [col4#0L, col3#5L, col2#10L, col1#11L, UDF(col1#11L) AS col5#45]
  +- Filter isnotnull(UDF(col1#11L))
 +- Join LeftOuter, (col2#10L = col2#6L)
:- Sort [col2#10L ASC NULLS FIRST], false
:  +- RepartitionByExpression [col2#10L]
: +- Filter (isnotnull(col2#10L) && isnotnull(col1#11L))
:+- LocalRelation [col2#10L, col1#11L]
+- Project [col4#0L, col3#5L, col2#6L]
   +- Join LeftOuter, (col3#5L = col3#1L)
  :- Sort [col3#5L ASC NULLS FIRST], false
  :  +- RepartitionByExpression [col3#5L]
  : +- Filter (isnotnull(col3#5L) && isnotnull(col2#6L))
  :+- LocalRelation [col3#5L, col2#6L]
  +- Sort [col3#1L ASC NULLS FIRST], false
 +- RepartitionByExpression [col3#1L]
+- Filter (isnotnull(col4#0L) && isnotnull(col3#1L))
   +- LocalRelation [col4#0L, col3#1L]
{noformat}

And that causes evaluation of the UDF in the 
{{org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin}} class, before 
the filter of the parameter value.

> query optimizer calls udf with null values when it doesn't expect them
> --
>
> Key: SPARK-20312
> URL: https://issues.apache.org/jira/browse/SPARK-20312
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Albert Meltzer
>
> When optimizing an outer join, spark passes an empty row to both sides to see 
> if nulls would be ignored (side comment: for half-outer joins it subsequently 
> ignores the assessment on the dominant side).
> For some reason, a condition such as "x IS NOT NULL && udf(x) IS NOT NULL" 
> might result in checking the right side first, and an exception if the udf 
> doesn't expect a null input (given the left side first).
> A example is SIMILAR to the following (see actual query plans separately):
> def func(value: Any): Int = ... // return AnyVal which probably causes a IS 
> NOT NULL added filter on the result
> val df1 = sparkSession
>   .table(...)
>   .select("col1", "col2") // LongType both
> val df11 = df1
>   .filter(df1("col1").isNotNull)
>   .withColumn("col3", functions.udf(func)(df1("col1"))
>   .repartition(df1("col2"))
>   .sortWithinPartitions(df1("col2"))
> val df2 = ... // load other data containing col2, similarly repartition and 
> sort
> val df3 =
>   df1.join(df2, Seq("col2"), "left_outer")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20312) query optimizer calls udf with null values when it doesn't expect them

2017-04-12 Thread Albert Meltzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966091#comment-15966091
 ] 

Albert Meltzer commented on SPARK-20312:


Query plans are as follows:

{noformat}
== Parsed Logical Plan ==
'InsertIntoTable 'UnresolvedRelation `database`.`table`, 
OverwriteOptions(true,Map()), false
+- Filter (((isnotnull(col2#10L) && isnotnull(col1#11L)) && 
isnotnull(version#59)) && isnotnull(col5#45))
   +- Project [col4#0L, col3#5L, col2#10L, col1#11L, version#59, col5#45]
  +- Project [col4#0L, col3#5L, col2#10L, col1#11L, col5#45, 2016-11-11 AS 
version#59]
 +- Repartition 8, true
+- Project [col4#0L, col3#5L, col2#10L, col1#11L, UDF(col1#11L) AS 
col5#45]
   +- Project [col4#0L, col3#5L, col2#10L, col1#11L]
  +- Project [col2#10L, col1#11L, col4#0L, col3#5L]
 +- Join LeftOuter, (col2#10L = col2#6L)
:- Sort [col2#10L ASC NULLS FIRST], false
:  +- RepartitionByExpression [col2#10L]
: +- Filter isnotnull(col1#11L)
:+- Filter isnotnull(col2#10L)
:   +- LocalRelation [col2#10L, col1#11L]
+- Project [col4#0L, col3#5L, col2#6L]
   +- Project [col3#5L, col2#6L, col4#0L]
  +- Join LeftOuter, (col3#5L = col3#1L)
 :- Sort [col3#5L ASC NULLS FIRST], false
 :  +- RepartitionByExpression [col3#5L]
 : +- Filter isnotnull(col2#6L)
 :+- Filter isnotnull(col3#5L)
 :   +- LocalRelation [col3#5L, col2#6L]
 +- Sort [col3#1L ASC NULLS FIRST], false
+- RepartitionByExpression [col3#1L]
   +- Filter isnotnull(col3#1L)
  +- Filter isnotnull(col4#0L)
 +- LocalRelation [col4#0L, col3#1L]

== Analyzed Logical Plan ==
InsertIntoTable MetastoreRelation database, table, Map(version -> None, col5 -> 
None), OverwriteOptions(true,Map()), false
+- Filter (((isnotnull(col2#10L) && isnotnull(col1#11L)) && 
isnotnull(version#59)) && isnotnull(col5#45))
   +- Project [col4#0L, col3#5L, col2#10L, col1#11L, version#59, col5#45]
  +- Project [col4#0L, col3#5L, col2#10L, col1#11L, col5#45, 2016-11-11 AS 
version#59]
 +- Repartition 8, true
+- Project [col4#0L, col3#5L, col2#10L, col1#11L, UDF(col1#11L) AS 
col5#45]
   +- Project [col4#0L, col3#5L, col2#10L, col1#11L]
  +- Project [col2#10L, col1#11L, col4#0L, col3#5L]
 +- Join LeftOuter, (col2#10L = col2#6L)
:- Sort [col2#10L ASC NULLS FIRST], false
:  +- RepartitionByExpression [col2#10L]
: +- Filter isnotnull(col1#11L)
:+- Filter isnotnull(col2#10L)
:   +- LocalRelation [col2#10L, col1#11L]
+- Project [col4#0L, col3#5L, col2#6L]
   +- Project [col3#5L, col2#6L, col4#0L]
  +- Join LeftOuter, (col3#5L = col3#1L)
 :- Sort [col3#5L ASC NULLS FIRST], false
 :  +- RepartitionByExpression [col3#5L]
 : +- Filter isnotnull(col2#6L)
 :+- Filter isnotnull(col3#5L)
 :   +- LocalRelation [col3#5L, col2#6L]
 +- Sort [col3#1L ASC NULLS FIRST], false
+- RepartitionByExpression [col3#1L]
   +- Filter isnotnull(col3#1L)
  +- Filter isnotnull(col4#0L)
 +- LocalRelation [col4#0L, col3#1L]

== Optimized Logical Plan ==
InsertIntoTable MetastoreRelation database, table, Map(version -> None, col5 -> 
None), OverwriteOptions(true,Map()), false
+- Project [col4#0L, col3#5L, col2#10L, col1#11L, 2016-11-11 AS version#59, 
col5#45]
   +- Repartition 8, true
  +- Project [col4#0L, col3#5L, col2#10L, col1#11L, UDF(col1#11L) AS 
col5#45]
 +- Join LeftOuter, (col2#10L = col2#6L)
:- Sort [col2#10L ASC NULLS FIRST], false
:  +- RepartitionByExpression [col2#10L]
: +- Filter ((isnotnull(col2#10L) && isnotnull(col1#11L)) && 
isnotnull(UDF(col1#11L)))
:+- LocalRelation [col2#10L, col1#11L]
+- Project [col4#0L, col3#5L, col2#6L]
   +- Join 

[jira] [Created] (SPARK-20312) query optimizer calls udf with null values when it doesn't expect them

2017-04-12 Thread Albert Meltzer (JIRA)
Albert Meltzer created SPARK-20312:
--

 Summary: query optimizer calls udf with null values when it 
doesn't expect them
 Key: SPARK-20312
 URL: https://issues.apache.org/jira/browse/SPARK-20312
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Albert Meltzer


When optimizing an outer join, spark passes an empty row to both sides to see 
if nulls would be ignored (side comment: for half-outer joins it subsequently 
ignores the assessment on the dominant side).

For some reason, a condition such as "x IS NOT NULL && udf(x) IS NOT NULL" 
might result in checking the right side first, and an exception if the udf 
doesn't expect a null input (given the left side first).

A example is SIMILAR to the following (see actual query plans separately):

def func(value: Any): Int = ... // return AnyVal which probably causes a IS NOT 
NULL added filter on the result

val df1 = sparkSession
  .table(...)
  .select("col1", "col2") // LongType both
val df11 = df1
  .filter(df1("col1").isNotNull)
  .withColumn("col3", functions.udf(func)(df1("col1"))
  .repartition(df1("col2"))
  .sortWithinPartitions(df1("col2"))

val df2 = ... // load other data containing col2, similarly repartition and sort

val df3 =
  df1.join(df2, Seq("col2"), "left_outer")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20292) string representation of TreeNode is messy

2017-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966083#comment-15966083
 ] 

Apache Spark commented on SPARK-20292:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/17623

> string representation of TreeNode is messy
> --
>
> Key: SPARK-20292
> URL: https://issues.apache.org/jira/browse/SPARK-20292
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>
> Currently we have a lot of string representations for QueryPlan/Expression: 
> toString, simpleString, verboseString, treeString, sql, etc.
> The logic between them is not very clear and as a result, 
> {{Expression.treeString}} is mostly unreadable as it contains a lot 
> duplicated information.
> We should clean it up



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20292) string representation of TreeNode is messy

2017-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20292:


Assignee: Apache Spark

> string representation of TreeNode is messy
> --
>
> Key: SPARK-20292
> URL: https://issues.apache.org/jira/browse/SPARK-20292
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> Currently we have a lot of string representations for QueryPlan/Expression: 
> toString, simpleString, verboseString, treeString, sql, etc.
> The logic between them is not very clear and as a result, 
> {{Expression.treeString}} is mostly unreadable as it contains a lot 
> duplicated information.
> We should clean it up



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20292) string representation of TreeNode is messy

2017-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20292:


Assignee: (was: Apache Spark)

> string representation of TreeNode is messy
> --
>
> Key: SPARK-20292
> URL: https://issues.apache.org/jira/browse/SPARK-20292
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>
> Currently we have a lot of string representations for QueryPlan/Expression: 
> toString, simpleString, verboseString, treeString, sql, etc.
> The logic between them is not very clear and as a result, 
> {{Expression.treeString}} is mostly unreadable as it contains a lot 
> duplicated information.
> We should clean it up



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20300) Python API for ALSModel.recommendForAllUsers,Items

2017-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966079#comment-15966079
 ] 

Apache Spark commented on SPARK-20300:
--

User 'MLnick' has created a pull request for this issue:
https://github.com/apache/spark/pull/17622

> Python API for ALSModel.recommendForAllUsers,Items
> --
>
> Key: SPARK-20300
> URL: https://issues.apache.org/jira/browse/SPARK-20300
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> Python API for ALSModel methods recommendForAllUsers, recommendForAllItems



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20300) Python API for ALSModel.recommendForAllUsers,Items

2017-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20300:


Assignee: (was: Apache Spark)

> Python API for ALSModel.recommendForAllUsers,Items
> --
>
> Key: SPARK-20300
> URL: https://issues.apache.org/jira/browse/SPARK-20300
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> Python API for ALSModel methods recommendForAllUsers, recommendForAllItems



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20300) Python API for ALSModel.recommendForAllUsers,Items

2017-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20300:


Assignee: Apache Spark

> Python API for ALSModel.recommendForAllUsers,Items
> --
>
> Key: SPARK-20300
> URL: https://issues.apache.org/jira/browse/SPARK-20300
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> Python API for ALSModel methods recommendForAllUsers, recommendForAllItems



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work

2017-04-12 Thread Juliusz Sompolski (JIRA)
Juliusz Sompolski created SPARK-20311:
-

 Summary: SQL "range(N) as alias" or "range(N) alias" doesn't work
 Key: SPARK-20311
 URL: https://issues.apache.org/jira/browse/SPARK-20311
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Juliusz Sompolski
Priority: Minor


`select * from range(10) as A;` or `select * from range(10) A;`
does not work.
As a workaround, a subquery has to be used:
`select * from (select * from range(10)) as A;`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19547) KafkaUtil throw 'No current assignment for partition' Exception

2017-04-12 Thread Omaiyma Popat (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965922#comment-15965922
 ] 

Omaiyma Popat commented on SPARK-19547:
---

Hi Team, Can you please advise a resolution on the above issue.
I am facing the same error for: spark-streaming-kafka-0-10_2.11:2.0.1:jar

  val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> 
prop.getProperty("kafka.broker.list"),
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> 
classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> 
classOf[StringDeserializer],
ConsumerConfig.GROUP_ID_CONFIG -> prop.getProperty("kafka.group.id"),
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> 
prop.getProperty("kafka.auto.offset.reset"),
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> (true: java.lang.Boolean)
)

  val topics = Array(prop.getProperty("kafka.topic"))

  val stream: InputDStream[ConsumerRecord[String, String]] = 
KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams))

> KafkaUtil throw 'No current assignment for partition' Exception
> ---
>
> Key: SPARK-19547
> URL: https://issues.apache.org/jira/browse/SPARK-19547
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 1.6.1
>Reporter: wuchang
>
> Below is my scala code to create spark kafka stream:
> val kafkaParams = Map[String, Object](
>   "bootstrap.servers" -> "server110:2181,server110:9092",
>   "zookeeper" -> "server110:2181",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> classOf[StringDeserializer],
>   "group.id" -> "example",
>   "auto.offset.reset" -> "latest",
>   "enable.auto.commit" -> (false: java.lang.Boolean)
> )
> val topics = Array("ABTest")
> val stream = KafkaUtils.createDirectStream[String, String](
>   ssc,
>   PreferConsistent,
>   Subscribe[String, String](topics, kafkaParams)
> )
> But after run for 10 hours, it throws exceptions:
> 2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.ConsumerCoordinator: 
> Revoking previously assigned partitions [ABTest-0, ABTest-1] for group example
> 2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.AbstractCoordinator: 
> (Re-)joining group example
> 2017-02-10 10:56:20,011 INFO  [JobGenerator] internals.AbstractCoordinator: 
> (Re-)joining group example
> 2017-02-10 10:56:40,057 INFO  [JobGenerator] internals.AbstractCoordinator: 
> Successfully joined group example with generation 5
> 2017-02-10 10:56:40,058 INFO  [JobGenerator] internals.ConsumerCoordinator: 
> Setting newly assigned partitions [ABTest-1] for group example
> 2017-02-10 10:56:40,080 ERROR [JobScheduler] scheduler.JobScheduler: Error 
> generating jobs for time 148669538 ms
> java.lang.IllegalStateException: No current assignment for partition ABTest-0
> at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
> at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:179)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:196)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
> at scala.Option.orElse(Option.scala:289)
> at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
> at 
> 

[jira] [Commented] (SPARK-19451) Long values in Window function

2017-04-12 Thread Julien Champ (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965915#comment-15965915
 ] 

Julien Champ commented on SPARK-19451:
--

Any news on this bug / feature request ?

> Long values in Window function
> --
>
> Key: SPARK-19451
> URL: https://issues.apache.org/jira/browse/SPARK-19451
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: Julien Champ
>
> Hi there,
> there seems to be a major limitation in spark window functions and 
> rangeBetween method.
> If I have the following code :
> {code:title=Exemple |borderStyle=solid}
> val tw =  Window.orderBy("date")
>   .partitionBy("id")
>   .rangeBetween( from , 0)
> {code}
> Everything seems ok, while *from* value is not too large... Even if the 
> rangeBetween() method supports Long parameters.
> But If i set *-216000L* value to *from* it does not work !
> It is probably related to this part of code in the between() method, of the 
> WindowSpec class, called by rangeBetween()
> {code:title=between() method|borderStyle=solid}
> val boundaryStart = start match {
>   case 0 => CurrentRow
>   case Long.MinValue => UnboundedPreceding
>   case x if x < 0 => ValuePreceding(-start.toInt)
>   case x if x > 0 => ValueFollowing(start.toInt)
> }
> {code}
> ( look at this *.toInt* )
> Does anybody know it there's a way to solve / patch this behavior ?
> Any help will be appreciated
> Thx



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2017-04-12 Thread Paul Zaczkieiwcz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965822#comment-15965822
 ] 

Paul Zaczkieiwcz commented on SPARK-18055:
--

Actually it seems it was this issue.  I forgot to mention that the Aggregator 
returns an Array[case class] which I would then like to flatMap.  I get the 
stack trace on the {{stitcher.toColumn}} line rather than the {{flatMap}} line. 
 I've since patched Spark with this PR to work around this issue.
{code:java}
case class StitchedVisitor(cookie_id:java.math.BigDecimal, visit_num:Int, ...)
case class CookieId(cookie_id:java.math.BigDecimal)
val 
aggregator[Aggregator[StitchedVisitor,Array[Option[StitchedVisitor]],Array[StitchedVisitor]]]
 = ...

df.groupByKey(s => CookieId(s.cookie_id)
).agg(stitcher.toColumn
).flatMap(agg => agg._2)
{code}

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>Assignee: Michael Armbrust
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20310) Dependency convergence error for scala-xml

2017-04-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20310:
--
Issue Type: Improvement  (was: Bug)

You may have a slightly earlier convergence problem in your build ... it looks 
like you're on Scala 2.11.0 instead of 2.11.8 like Spark. Can you manage that 
up? Maybe that ends up agreeing on scala-xml 1.0.2, though, I doubt it actually 
causes a problem in any event.


> Dependency convergence error for scala-xml
> --
>
> Key: SPARK-20310
> URL: https://issues.apache.org/jira/browse/SPARK-20310
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Samik R
>Priority: Minor
>
> Hi,
> I am trying to compile a package (apache tinkerpop) which has spark-core as 
> one of the dependencies. I am trying to compile with v2.1.0. But when I run 
> maven build through a dependency checker, it is showing a dependency error 
> within the spark-core itself for scala-xml package, as below:
> Dependency convergence error for org.scala-lang.modules:scala-xml_2.11:1.0.1 
> paths to dependency are:
> +-org.apache.tinkerpop:spark-gremlin:3.2.3
>   +-org.apache.spark:spark-core_2.11:2.1.0
> +-org.json4s:json4s-jackson_2.11:3.2.11
>   +-org.json4s:json4s-core_2.11:3.2.11
> +-org.scala-lang:scalap:2.11.0
>   +-org.scala-lang:scala-compiler:2.11.0
> +-org.scala-lang.modules:scala-xml_2.11:1.0.1
> and
> +-org.apache.tinkerpop:spark-gremlin:3.2.3
>   +-org.apache.spark:spark-core_2.11:2.1.0
> +-org.apache.spark:spark-tags_2.11:2.1.0
>   +-org.scalatest:scalatest_2.11:2.2.6
> +-org.scala-lang.modules:scala-xml_2.11:1.0.2
> Can this be fixed?
> Thanks.
> -Samik



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20310) Dependency convergence error for scala-xml

2017-04-12 Thread Samik R (JIRA)
Samik R created SPARK-20310:
---

 Summary: Dependency convergence error for scala-xml
 Key: SPARK-20310
 URL: https://issues.apache.org/jira/browse/SPARK-20310
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.1.0
Reporter: Samik R
Priority: Minor


Hi,

I am trying to compile a package (apache tinkerpop) which has spark-core as one 
of the dependencies. I am trying to compile with v2.1.0. But when I run maven 
build through a dependency checker, it is showing a dependency error within the 
spark-core itself for scala-xml package, as below:

Dependency convergence error for org.scala-lang.modules:scala-xml_2.11:1.0.1 
paths to dependency are:
+-org.apache.tinkerpop:spark-gremlin:3.2.3
  +-org.apache.spark:spark-core_2.11:2.1.0
+-org.json4s:json4s-jackson_2.11:3.2.11
  +-org.json4s:json4s-core_2.11:3.2.11
+-org.scala-lang:scalap:2.11.0
  +-org.scala-lang:scala-compiler:2.11.0
+-org.scala-lang.modules:scala-xml_2.11:1.0.1
and
+-org.apache.tinkerpop:spark-gremlin:3.2.3
  +-org.apache.spark:spark-core_2.11:2.1.0
+-org.apache.spark:spark-tags_2.11:2.1.0
  +-org.scalatest:scalatest_2.11:2.2.6
+-org.scala-lang.modules:scala-xml_2.11:1.0.2

Can this be fixed?
Thanks.
-Samik




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20306) beeline connect spark thrift server failure

2017-04-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-20306:
---

> beeline connect spark thrift server  failure
> 
>
> Key: SPARK-20306
> URL: https://issues.apache.org/jira/browse/SPARK-20306
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.2
>Reporter: sydt
>
> Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and 
> error is :
> Error occurred during processing of message.
> java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: 
> Unsupported mechanism type GSSAPI
> The connect command is :
> !connect 
> jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn
> . However, it works well before. 
> When my hadoop cluster equipped with federation ,it can not work well



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20306) beeline connect spark thrift server failure

2017-04-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20306.
---
Resolution: Not A Problem

> beeline connect spark thrift server  failure
> 
>
> Key: SPARK-20306
> URL: https://issues.apache.org/jira/browse/SPARK-20306
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.2
>Reporter: sydt
>
> Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and 
> error is :
> Error occurred during processing of message.
> java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: 
> Unsupported mechanism type GSSAPI
> The connect command is :
> !connect 
> jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn
> . However, it works well before. 
> When my hadoop cluster equipped with federation ,it can not work well



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20309) Repartitioning - more than the default number of partitions

2017-04-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20309.
---
Resolution: Not A Problem

> Repartitioning - more than the default number of partitions
> ---
>
> Key: SPARK-20309
> URL: https://issues.apache.org/jira/browse/SPARK-20309
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.6.1
> Environment: Spark 1.6.1, Hadoop 2.6.2, Redhat Linux
>Reporter: balaji krishnan
>
> We have a requirement to process roughly 45MM rows. Each row has a KEY 
> column. The unique number of KEYS are roughly 25000. I do the following Spark 
> SQL to get all 25000 partitions
> hiveContext.sql("select * from BigTbl distribute by KEY")
> This nicely creates all required partitions. Using MapPartitions we process 
> all of these partitions in parallel. I have a variable which should be 
> initiated once per partition. Lets call that variable name as designName. 
> This variable is initialized with the KEY value (partition value) for future 
> identification purposes.
>  FlatMapFunction flatMapSetup = new 
> FlatMapFunction() {
> String designName= null; 
> @Override
> public java.lang.Iterable call(java.util.Iterator it) 
> throws Exception {
> while (it.hasNext()) {
> Row row = it.next(); //row is of type org.apache.spark.sql.Row
> if (designName == null) { 
> designName = row.getString(1); // row is of type org.apache.spark.sql.Row
> }
> .
> }
> List Dsg =  partitionedRecs.mapPartitions(flatMapSetup).collect();
> The problem starts here for me.
> Since i have more than the default number of partitions (200) i do a 
> re-partition to the unique number of partitions in my BigTbl. However if i 
> re-partition the designName gets completely confused. Instead of processing a 
> total of 25000 unique values, i get only 15000 values processed. For some 
> reason the repartition completely messes up the uniqueness of the previous 
> step. The statement that i use to 'distribute by' and subsequently repartition
> long distinctValues = hiveContext.sql("select KEY from 
> BigTbl").distinct().count();
> JavaRDD partitionedRecs  = hiveContext.sql("select * from BigTbl 
> DISTRIBUTE by SKU ").repartition((int)distinctValues).toJavaRDD().cache();
> I tried another solution, that is by changing the 
> spark.sql.shuffle.partitions during SparkContext creation. Even that did not 
> help. I get the same issue.
> new JavaSparkContext(new 
> SparkConf().set("spark.driver.maxResultSize","0").set("spark.sql.shuffle.partitions","26000"))
> Is there a way to solve this issue please or is this a DISTRIBUTE BY bug ? We 
> are using Spark 1.6.1
> Regards
> Bala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2017-04-12 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965735#comment-15965735
 ] 

Fei Wang commented on SPARK-20184:
--

Also use the master branch to test my test case:
1. Java version
192:spark wangfei$ java -version
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)

2. Spark starting cmd
192:spark wangfei$ bin/spark-sql --master local[4] --driver-memory 16g

3. test result:
sql: {code}
select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), 
sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), 
sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 
group by dim_1, dim_2 limit 100;
  {code}
codegen on: about 1.4s
codegen off: about 0.6s


> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20306) beeline connect spark thrift server failure

2017-04-12 Thread sydt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sydt closed SPARK-20306.

Resolution: Fixed

> beeline connect spark thrift server  failure
> 
>
> Key: SPARK-20306
> URL: https://issues.apache.org/jira/browse/SPARK-20306
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.2
>Reporter: sydt
>
> Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and 
> error is :
> Error occurred during processing of message.
> java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: 
> Unsupported mechanism type GSSAPI
> The connect command is :
> !connect 
> jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn
> . However, it works well before. 
> When my hadoop cluster equipped with federation ,it can not work well



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20309) Repartitioning - more than the default number of partitions

2017-04-12 Thread balaji krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965715#comment-15965715
 ] 

balaji krishnan commented on SPARK-20309:
-

thanks Sean. I will send that as an email to u...@spark.apache.org. 

I need to re partition to process all partitions as the number of partitions 
that i have in my case is higher than the default partition (200).

Will try and articulate even better while sending this to that email address..

Thanks for quick response.

Regards

Bala

> Repartitioning - more than the default number of partitions
> ---
>
> Key: SPARK-20309
> URL: https://issues.apache.org/jira/browse/SPARK-20309
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.6.1
> Environment: Spark 1.6.1, Hadoop 2.6.2, Redhat Linux
>Reporter: balaji krishnan
>
> We have a requirement to process roughly 45MM rows. Each row has a KEY 
> column. The unique number of KEYS are roughly 25000. I do the following Spark 
> SQL to get all 25000 partitions
> hiveContext.sql("select * from BigTbl distribute by KEY")
> This nicely creates all required partitions. Using MapPartitions we process 
> all of these partitions in parallel. I have a variable which should be 
> initiated once per partition. Lets call that variable name as designName. 
> This variable is initialized with the KEY value (partition value) for future 
> identification purposes.
>  FlatMapFunction flatMapSetup = new 
> FlatMapFunction() {
> String designName= null; 
> @Override
> public java.lang.Iterable call(java.util.Iterator it) 
> throws Exception {
> while (it.hasNext()) {
> Row row = it.next(); //row is of type org.apache.spark.sql.Row
> if (designName == null) { 
> designName = row.getString(1); // row is of type org.apache.spark.sql.Row
> }
> .
> }
> List Dsg =  partitionedRecs.mapPartitions(flatMapSetup).collect();
> The problem starts here for me.
> Since i have more than the default number of partitions (200) i do a 
> re-partition to the unique number of partitions in my BigTbl. However if i 
> re-partition the designName gets completely confused. Instead of processing a 
> total of 25000 unique values, i get only 15000 values processed. For some 
> reason the repartition completely messes up the uniqueness of the previous 
> step. The statement that i use to 'distribute by' and subsequently repartition
> long distinctValues = hiveContext.sql("select KEY from 
> BigTbl").distinct().count();
> JavaRDD partitionedRecs  = hiveContext.sql("select * from BigTbl 
> DISTRIBUTE by SKU ").repartition((int)distinctValues).toJavaRDD().cache();
> I tried another solution, that is by changing the 
> spark.sql.shuffle.partitions during SparkContext creation. Even that did not 
> help. I get the same issue.
> new JavaSparkContext(new 
> SparkConf().set("spark.driver.maxResultSize","0").set("spark.sql.shuffle.partitions","26000"))
> Is there a way to solve this issue please or is this a DISTRIBUTE BY bug ? We 
> are using Spark 1.6.1
> Regards
> Bala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20309) Repartitioning - more than the default number of partitions

2017-04-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965714#comment-15965714
 ] 

Sean Owen commented on SPARK-20309:
---

It's not clear what you're asking, but this should go to u...@spark.apache.org, 
not JIRA.
Repartitioning isn't going to preserve some partition invariant from the 
original partitioning, in general. It doesn't sound like a problem.

> Repartitioning - more than the default number of partitions
> ---
>
> Key: SPARK-20309
> URL: https://issues.apache.org/jira/browse/SPARK-20309
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 1.6.1
> Environment: Spark 1.6.1, Hadoop 2.6.2, Redhat Linux
>Reporter: balaji krishnan
>
> We have a requirement to process roughly 45MM rows. Each row has a KEY 
> column. The unique number of KEYS are roughly 25000. I do the following Spark 
> SQL to get all 25000 partitions
> hiveContext.sql("select * from BigTbl distribute by KEY")
> This nicely creates all required partitions. Using MapPartitions we process 
> all of these partitions in parallel. I have a variable which should be 
> initiated once per partition. Lets call that variable name as designName. 
> This variable is initialized with the KEY value (partition value) for future 
> identification purposes.
>  FlatMapFunction flatMapSetup = new 
> FlatMapFunction() {
> String designName= null; 
> @Override
> public java.lang.Iterable call(java.util.Iterator it) 
> throws Exception {
> while (it.hasNext()) {
> Row row = it.next(); //row is of type org.apache.spark.sql.Row
> if (designName == null) { 
> designName = row.getString(1); // row is of type org.apache.spark.sql.Row
> }
> .
> }
> List Dsg =  partitionedRecs.mapPartitions(flatMapSetup).collect();
> The problem starts here for me.
> Since i have more than the default number of partitions (200) i do a 
> re-partition to the unique number of partitions in my BigTbl. However if i 
> re-partition the designName gets completely confused. Instead of processing a 
> total of 25000 unique values, i get only 15000 values processed. For some 
> reason the repartition completely messes up the uniqueness of the previous 
> step. The statement that i use to 'distribute by' and subsequently repartition
> long distinctValues = hiveContext.sql("select KEY from 
> BigTbl").distinct().count();
> JavaRDD partitionedRecs  = hiveContext.sql("select * from BigTbl 
> DISTRIBUTE by SKU ").repartition((int)distinctValues).toJavaRDD().cache();
> I tried another solution, that is by changing the 
> spark.sql.shuffle.partitions during SparkContext creation. Even that did not 
> help. I get the same issue.
> new JavaSparkContext(new 
> SparkConf().set("spark.driver.maxResultSize","0").set("spark.sql.shuffle.partitions","26000"))
> Is there a way to solve this issue please or is this a DISTRIBUTE BY bug ? We 
> are using Spark 1.6.1
> Regards
> Bala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20309) Repartitioning - more than the default number of partitions

2017-04-12 Thread balaji krishnan (JIRA)
balaji krishnan created SPARK-20309:
---

 Summary: Repartitioning - more than the default number of 
partitions
 Key: SPARK-20309
 URL: https://issues.apache.org/jira/browse/SPARK-20309
 Project: Spark
  Issue Type: Bug
  Components: Java API, SQL
Affects Versions: 1.6.1
 Environment: Spark 1.6.1, Hadoop 2.6.2, Redhat Linux
Reporter: balaji krishnan


We have a requirement to process roughly 45MM rows. Each row has a KEY column. 
The unique number of KEYS are roughly 25000. I do the following Spark SQL to 
get all 25000 partitions

hiveContext.sql("select * from BigTbl distribute by KEY")
This nicely creates all required partitions. Using MapPartitions we process all 
of these partitions in parallel. I have a variable which should be initiated 
once per partition. Lets call that variable name as designName. This variable 
is initialized with the KEY value (partition value) for future identification 
purposes.

 FlatMapFunction flatMapSetup = new 
FlatMapFunction() {
String designName= null; 
@Override
public java.lang.Iterable call(java.util.Iterator it) 
throws Exception {
while (it.hasNext()) {
Row row = it.next(); //row is of type org.apache.spark.sql.Row
if (designName == null) { 
designName = row.getString(1); // row is of type org.apache.spark.sql.Row
}
.
}

List Dsg =  partitionedRecs.mapPartitions(flatMapSetup).collect();
The problem starts here for me.

Since i have more than the default number of partitions (200) i do a 
re-partition to the unique number of partitions in my BigTbl. However if i 
re-partition the designName gets completely confused. Instead of processing a 
total of 25000 unique values, i get only 15000 values processed. For some 
reason the repartition completely messes up the uniqueness of the previous 
step. The statement that i use to 'distribute by' and subsequently repartition

long distinctValues = hiveContext.sql("select KEY from 
BigTbl").distinct().count();
JavaRDD partitionedRecs  = hiveContext.sql("select * from BigTbl 
DISTRIBUTE by SKU ").repartition((int)distinctValues).toJavaRDD().cache();
I tried another solution, that is by changing the spark.sql.shuffle.partitions 
during SparkContext creation. Even that did not help. I get the same issue.

new JavaSparkContext(new 
SparkConf().set("spark.driver.maxResultSize","0").set("spark.sql.shuffle.partitions","26000"))
Is there a way to solve this issue please or is this a DISTRIBUTE BY bug ? We 
are using Spark 1.6.1

Regards

Bala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20308) org.apache.spark.shuffle.FetchFailedException: Too large frame

2017-04-12 Thread Artur Sukhenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artur Sukhenko resolved SPARK-20308.

Resolution: Duplicate

> org.apache.spark.shuffle.FetchFailedException: Too large frame
> --
>
> Key: SPARK-20308
> URL: https://issues.apache.org/jira/browse/SPARK-20308
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: Stanislav Chernichkin
>
> Spark uses custom frame decoder (TransportFrameDecoder) which does not 
> support frames larger than 2G. This lead to fails when shuffling using large 
> partitions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18692) Test Java 8 unidoc build on Jenkins master builder

2017-04-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-18692:
-

Assignee: Hyukjin Kwon

> Test Java 8 unidoc build on Jenkins master builder
> --
>
> Key: SPARK-18692
> URL: https://issues.apache.org/jira/browse/SPARK-18692
> Project: Spark
>  Issue Type: Test
>  Components: Build, Documentation
>Reporter: Joseph K. Bradley
>Assignee: Hyukjin Kwon
>  Labels: jenkins
> Fix For: 2.2.0
>
>
> [SPARK-3359] fixed the unidoc build for Java 8, but it is easy to break.  It 
> would be great to add this build to the Spark master builder on Jenkins to 
> make it easier to identify PRs which break doc builds.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18692) Test Java 8 unidoc build on Jenkins master builder

2017-04-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18692.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17477
[https://github.com/apache/spark/pull/17477]

> Test Java 8 unidoc build on Jenkins master builder
> --
>
> Key: SPARK-18692
> URL: https://issues.apache.org/jira/browse/SPARK-18692
> Project: Spark
>  Issue Type: Test
>  Components: Build, Documentation
>Reporter: Joseph K. Bradley
>  Labels: jenkins
> Fix For: 2.2.0
>
>
> [SPARK-3359] fixed the unidoc build for Java 8, but it is easy to break.  It 
> would be great to add this build to the Spark master builder on Jenkins to 
> make it easier to identify PRs which break doc builds.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2017-04-12 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965670#comment-15965670
 ] 

Herman van Hovell commented on SPARK-20184:
---

I just tried your example using the master branch, both take about 0.9 seconds 
on my local machine. So I am not sure what the problem is... Which version of 
Spark are you using? What JVM are you running on? How are you starting spark?

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18791) Stream-Stream Joins

2017-04-12 Thread xianyao jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965657#comment-15965657
 ] 

xianyao jiang commented on SPARK-18791:
---

when this feature will be provided?
is there any idea about it?
we want to use the structured streaming, but we need the stream-stream join 
function.
 

> Stream-Stream Joins
> ---
>
> Key: SPARK-18791
> URL: https://issues.apache.org/jira/browse/SPARK-18791
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>
> Just a placeholder for now.  Please comment with your requirements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20306) beeline connect spark thrift server failure

2017-04-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965654#comment-15965654
 ] 

Sean Owen commented on SPARK-20306:
---

This is some env or config issue, not Spark.

> beeline connect spark thrift server  failure
> 
>
> Key: SPARK-20306
> URL: https://issues.apache.org/jira/browse/SPARK-20306
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.2
>Reporter: sydt
>
> Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and 
> error is :
> Error occurred during processing of message.
> java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: 
> Unsupported mechanism type GSSAPI
> The connect command is :
> !connect 
> jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn
> . However, it works well before. 
> When my hadoop cluster equipped with federation ,it can not work well



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20296) UnsupportedOperationChecker text on distinct aggregations differs from docs

2017-04-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20296:
-

Assignee: Jason Tokayer

> UnsupportedOperationChecker text on distinct aggregations differs from docs
> ---
>
> Key: SPARK-20296
> URL: https://issues.apache.org/jira/browse/SPARK-20296
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Jason Tokayer
>Assignee: Jason Tokayer
>Priority: Trivial
> Fix For: 2.1.2, 2.2.0
>
>
> In the unsupported operations section in the docs 
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
>  it states that "Distinct operations on streaming Datasets are not 
> supported.". However, in 
> ```org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker.scala```,
>  the error message is ```Distinct aggregations are not supported on streaming 
> DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete 
> output mode. Consider using approximate distinct aggregation```.
> It seems that the error message is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20296) UnsupportedOperationChecker text on distinct aggregations differs from docs

2017-04-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20296.
---
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0

Issue resolved by pull request 17609
[https://github.com/apache/spark/pull/17609]

> UnsupportedOperationChecker text on distinct aggregations differs from docs
> ---
>
> Key: SPARK-20296
> URL: https://issues.apache.org/jira/browse/SPARK-20296
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Jason Tokayer
>Priority: Trivial
> Fix For: 2.2.0, 2.1.2
>
>
> In the unsupported operations section in the docs 
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
>  it states that "Distinct operations on streaming Datasets are not 
> supported.". However, in 
> ```org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker.scala```,
>  the error message is ```Distinct aggregations are not supported on streaming 
> DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete 
> output mode. Consider using approximate distinct aggregation```.
> It seems that the error message is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20308) org.apache.spark.shuffle.FetchFailedException: Too large frame

2017-04-12 Thread Stanislav Chernichkin (JIRA)
Stanislav Chernichkin created SPARK-20308:
-

 Summary: org.apache.spark.shuffle.FetchFailedException: Too large 
frame
 Key: SPARK-20308
 URL: https://issues.apache.org/jira/browse/SPARK-20308
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 2.1.0
Reporter: Stanislav Chernichkin


Spark uses custom frame decoder (TransportFrameDecoder) which does not support 
frames larger than 2G. This lead to fails when shuffling using large partitions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter

2017-04-12 Thread pralabhkumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965647#comment-15965647
 ] 

pralabhkumar commented on SPARK-20199:
--

For GBM its using Random Forest ,and to add randomness to tree ,there should be 
featureSubsetStrategy .  FeatureSubsetStrategy should not be hardcoded for GBM 
. 

> GradientBoostedTreesModel doesn't have  Column Sampling Rate Paramenter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-04-12 Thread Anne Rutten (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anne Rutten updated SPARK-20307:

Description: 
when training a model in SparkR with string variables (tested with 
spark.randomForest, but i assume is valid for all spark.xx functions that apply 
a StringIndexer under the hood), testing on a new dataset with factor levels 
that are not in the training set will throw an "Unseen label" error. 
I think this can be solved if there's a method to pass setHandleInvalid on to 
the StringIndexers when calling spark.randomForest.

code snippet:
{code}
# (i've run this in Zeppelin which already has SparkR and the context loaded)
#library(SparkR)
#sparkR.session(master = "local[*]") 
data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
  someString = base::sample(c("this", "that"), 100, 
replace=TRUE), stringsAsFactors=FALSE)
trainidxs = base::sample(nrow(data), nrow(data)*0.7)
traindf = as.DataFrame(data[trainidxs,])
testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))

rf = spark.randomForest(traindf, clicked~., type="classification", 
maxDepth=10, 
maxBins=41,
numTrees = 100)

predictions = predict(rf, testdf)
SparkR::collect(predictions)
{code}
stack trace:
{quote}
Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most recent 
failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor driver): 
org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$4: (string) => double)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Unseen label: the other.
at 
org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
at 
org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
... 16 more

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at 

[jira] [Updated] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-04-12 Thread Anne Rutten (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anne Rutten updated SPARK-20307:

Description: 
when training a model in SparkR with string variables (tested with 
spark.randomForest, but i assume is valid for all spark.xx functions that apply 
a StringIndexer under the hood), testing on a new dataset with factor levels 
that are not in the training set will throw an "Unseen label" error. 
I think this can be solved if there's a method to pass setHandleInvalid on to 
the StringIndexers when calling spark.randomForest.

code snippet:
{code}
# (i've run this in Zeppelin which already has SparkR and the context loaded)
#library(SparkR)
#sparkR.session(master = "local[*]") 

data = data.frame(clicked = sample(c(0,1),100,replace=TRUE),
  someString = sample(c("this", "that"), 100, 
replace=TRUE), stringsAsFactors=FALSE)
trainidxs = base::sample(nrow(data), nrow(data)*0.7)
traindf = as.DataFrame(data[trainidxs,])
testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))

rf = spark.randomForest(traindf, clicked~., type="classification", 
maxDepth=10, 
maxBins=41,
numTrees = 100)

predictions = predict(rf, testdf, allow.new.labels=TRUE)
SparkR::collect(predictions)   
{code}
stack trace:
{quote}
Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most recent 
failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor driver): 
org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$4: (string) => double)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Unseen label: the other.
at 
org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
at 
org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
... 16 more

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at 

[jira] [Updated] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter

2017-04-12 Thread pralabhkumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-20199:
-
Priority: Major  (was: Minor)

> GradientBoostedTreesModel doesn't have  Column Sampling Rate Paramenter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-04-12 Thread Anne Rutten (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anne Rutten updated SPARK-20307:

Priority: Minor  (was: Major)

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Priority: Minor
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with
> may throw an "Unseen label" error when not all factor levels are present in 
> the training set. I think this can be solved if there's a method to pass 
> setHandleInvalid on to the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = sample(c(0,1),100,replace=TRUE),
>   someString = sample(c("this", "that"), 100, 
> replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf, allow.new.labels=TRUE)
> SparkR::collect(predictions)   
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at scala.Option.foreach(Option.scala:257)
> at 
> 

[jira] [Updated] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-04-12 Thread Anne Rutten (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anne Rutten updated SPARK-20307:

Component/s: (was: MLlib)

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with
> may throw an "Unseen label" error when not all factor levels are present in 
> the training set. I think this can be solved if there's a method to pass 
> setHandleInvalid on to the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = sample(c(0,1),100,replace=TRUE),
>   someString = sample(c("this", "that"), 100, 
> replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf, allow.new.labels=TRUE)
> SparkR::collect(predictions)   
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at scala.Option.foreach(Option.scala:257)
> at 
> 

[jira] [Updated] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-04-12 Thread Anne Rutten (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anne Rutten updated SPARK-20307:

Component/s: SparkR

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with
> may throw an "Unseen label" error when not all factor levels are present in 
> the training set. I think this can be solved if there's a method to pass 
> setHandleInvalid on to the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = sample(c(0,1),100,replace=TRUE),
>   someString = sample(c("this", "that"), 100, 
> replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf, allow.new.labels=TRUE)
> SparkR::collect(predictions)   
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)

[jira] [Created] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-04-12 Thread Anne Rutten (JIRA)
Anne Rutten created SPARK-20307:
---

 Summary: SparkR: pass on setHandleInvalid to spark.mllib functions 
that use StringIndexer
 Key: SPARK-20307
 URL: https://issues.apache.org/jira/browse/SPARK-20307
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.1.0
Reporter: Anne Rutten


when training a model in SparkR with string variables (tested with 
spark.randomForest, but i assume is valid for all spark.xx functions that apply 
a StringIndexer under the hood), testing on a new dataset with
may throw an "Unseen label" error when not all factor levels are present in the 
training set. I think this can be solved if there's a method to pass 
setHandleInvalid on to the StringIndexers when calling spark.randomForest.

code snippet:
{code}
# (i've run this in Zeppelin which already has SparkR and the context loaded)
#library(SparkR)
#sparkR.session(master = "local[*]") 

data = data.frame(clicked = sample(c(0,1),100,replace=TRUE),
  someString = sample(c("this", "that"), 100, 
replace=TRUE), stringsAsFactors=FALSE)
trainidxs = base::sample(nrow(data), nrow(data)*0.7)
traindf = as.DataFrame(data[trainidxs,])
testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))

rf = spark.randomForest(traindf, clicked~., type="classification", 
maxDepth=10, 
maxBins=41,
numTrees = 100)

predictions = predict(rf, testdf, allow.new.labels=TRUE)
SparkR::collect(predictions)   
{code}
stack trace:
{quote}
Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most recent 
failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor driver): 
org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$4: (string) => double)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Unseen label: the other.
at 
org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
at 
org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
... 16 more

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 

[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark

2017-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965632#comment-15965632
 ] 

Apache Spark commented on SPARK-6227:
-

User 'MLnick' has created a pull request for this issue:
https://github.com/apache/spark/pull/17621

> PCA and SVD for PySpark
> ---
>
> Key: SPARK-6227
> URL: https://issues.apache.org/jira/browse/SPARK-6227
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.2.1
>Reporter: Julien Amelot
>Assignee: Manoj Kumar
>
> The Dimensionality Reduction techniques are not available via Python (Scala + 
> Java only).
> * Principal component analysis (PCA)
> * Singular value decomposition (SVD)
> Doc:
> http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-12 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965601#comment-15965601
 ] 

Hyukjin Kwon commented on SPARK-20294:
--

{quote}
If we now that the rdd is not empty (the function tests for that before line 
354), then we should at least use the first row as a fallback if the sampling 
fails.
{quote}

I think this can be just in application-side if we both agree that it is a 
narrow case.

{code}
small_rdd = sc.parallelize([(1, '2'), (2, 'foo')])
try:
df = small_rdd.toDF(sampleRatio=0.01)
except ValueError:
df = small_rdd.toDF()
{code}



> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> {code}
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> {code}
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2017-04-12 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965597#comment-15965597
 ] 

Fei Wang edited comment on SPARK-20184 at 4/12/17 9:21 AM:
---

try this :
1. create table
{code}

val df = (1 to 50).map(x => (x.toString, x.toString, x, x, x, x, x, x, x, 
x, x, x, x, x, x, x, x, x, x, x, x, x)).toDF("dim_1", "dim_2", "c1", "c2", 
"c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10","c11", "c12", "c13", "c14", 
"c15", "c16", "c17", "c18", "c19", "c20")
df.write.saveAsTable("sum_table_50w_3")

{code}

2. query the table
{code}
select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), 
sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), 
sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 
group by dim_1, dim_2 limit 100
{code}


was (Author: scwf):
try this :
1. create table
{code}

val df = (1 to 50).map(x => (x.toString, x.toString, x, x, x, x, x, x, x, 
x, x, x, x, x, x, x, x, x, x, x, x, x)).toDF("dim_1", "dim_2", "c1", "c2", 
"c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10","c11", "c12", "c13", "c14", 
"c15", "c16", "c17", "c18", "c19", "c20")
df.write.saveAsTable("sum_table_50w_3")

{code}

2. query the table

select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), 
sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), 
sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 
group by dim_1, dim_2 limit 100

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2017-04-12 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965597#comment-15965597
 ] 

Fei Wang edited comment on SPARK-20184 at 4/12/17 9:21 AM:
---

try this :
1. create table
{code}

val df = (1 to 50).map(x => (x.toString, x.toString, x, x, x, x, x, x, x, 
x, x, x, x, x, x, x, x, x, x, x, x, x)).toDF("dim_1", "dim_2", "c1", "c2", 
"c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10","c11", "c12", "c13", "c14", 
"c15", "c16", "c17", "c18", "c19", "c20")
df.write.saveAsTable("sum_table_50w_3")

{code}

2. query the table

select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), 
sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), 
sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 
group by dim_1, dim_2 limit 100


was (Author: scwf):
try this :
1. create table
[code]
val df = (1 to 50).map(x => (x.toString, x.toString, x, x, x, x, x, x, x, 
x, x, x, x, x, x, x, x, x, x, x, x, x)).toDF("dim_1", "dim_2", "c1", "c2", 
"c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10","c11", "c12", "c13", "c14", 
"c15", "c16", "c17", "c18", "c19", "c20")
df.write.saveAsTable("sum_table_50w_3")

df.write.format("csv").saveAsTable("sum_table_50w_1")

[code]

2. query the table

select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), 
sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), 
sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 
group by dim_1, dim_2 limit 100

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2017-04-12 Thread Fei Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965597#comment-15965597
 ] 

Fei Wang commented on SPARK-20184:
--

try this :
1. create table
[code]
val df = (1 to 50).map(x => (x.toString, x.toString, x, x, x, x, x, x, x, 
x, x, x, x, x, x, x, x, x, x, x, x, x)).toDF("dim_1", "dim_2", "c1", "c2", 
"c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10","c11", "c12", "c13", "c14", 
"c15", "c16", "c17", "c18", "c19", "c20")
df.write.saveAsTable("sum_table_50w_3")

df.write.format("csv").saveAsTable("sum_table_50w_1")

[code]

2. query the table

select dim_1, dim_2, sum(c1), sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), 
sum(c7), sum(c8), sum(c9), sum(c10), sum(c11), sum(c12), sum(c13), sum(c14), 
sum(c15), sum(c16), sum(c17), sum(c18), sum(c19), sum(c20) from sum_table_50w_3 
group by dim_1, dim_2 limit 100

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965590#comment-15965590
 ] 

João Pedro Jericó commented on SPARK-20294:
---

Yes, I think that this would be a nice solution, i.e., given that we *know* 
it's not empty, we should always try to do something. Right now spark skips 
directly to sampling and raises an error if it fails.



> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> {code}
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> {code}
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20306) beeline connect spark thrift server failure

2017-04-12 Thread sydt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sydt updated SPARK-20306:
-
Description: 
Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and 
error is :
Error occurred during processing of message.
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: 
Unsupported mechanism type GSSAPI
The connect command is :
!connect 
jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn
. However, it works well before. 
When my hadoop cluster equipped with federation ,it can not work well


  was:
Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and 
error is :
ERROR server.TThreadPoolServer: Error occurred during processing of message.
Peer indicated failure: Unsupported mechanism type GSSAPI.
The connect command is :
!connect 
jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn
. However, it works well before. 
When my hadoop cluster equipped with federation ,it can not work well



> beeline connect spark thrift server  failure
> 
>
> Key: SPARK-20306
> URL: https://issues.apache.org/jira/browse/SPARK-20306
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.2
>Reporter: sydt
>
> Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and 
> error is :
> Error occurred during processing of message.
> java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: 
> Unsupported mechanism type GSSAPI
> The connect command is :
> !connect 
> jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn
> . However, it works well before. 
> When my hadoop cluster equipped with federation ,it can not work well



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20306) beeline connect spark thrift server failure

2017-04-12 Thread sydt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sydt updated SPARK-20306:
-
Description: 
Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and 
error is :
ERROR server.TThreadPoolServer: Error occurred during processing of message.
Peer indicated failure: Unsupported mechanism type GSSAPI.
The connect command is :
!connect 
jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn
. However, it works well before. 
When my hadoop cluster equipped with federation ,it can not work well


  was:
Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and 
error is :
Peer indicated failure: Unsupported mechanism type GSSAPI.
The connect command is :
!connect 
jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn
. However, it works well before. 
When my hadoop cluster equipped with federation ,it can not work well



> beeline connect spark thrift server  failure
> 
>
> Key: SPARK-20306
> URL: https://issues.apache.org/jira/browse/SPARK-20306
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.2
>Reporter: sydt
>
> Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and 
> error is :
> ERROR server.TThreadPoolServer: Error occurred during processing of message.
> Peer indicated failure: Unsupported mechanism type GSSAPI.
> The connect command is :
> !connect 
> jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn
> . However, it works well before. 
> When my hadoop cluster equipped with federation ,it can not work well



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965579#comment-15965579
 ] 

Sean Owen commented on SPARK-20294:
---

CC [~andrewor14] who might be able to comment better on this.

If there's no sampling, it tries the first element, then first 100, then fails 
and asks you to try sampling.
If there's sampling, it just tries to operate on the sample. 

Would it be more sensible to:

- Try the first element
- Try the first hundred
- Try a sample



> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> {code}
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> {code}
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20306) beeline connect spark thrift server failure

2017-04-12 Thread sydt (JIRA)
sydt created SPARK-20306:


 Summary: beeline connect spark thrift server  failure
 Key: SPARK-20306
 URL: https://issues.apache.org/jira/browse/SPARK-20306
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.6.2
Reporter: sydt


Beeline connect spark thrift server of spark-1.6.2 with kerberos failure and 
error is :
Peer indicated failure: Unsupported mechanism type GSSAPI.
The connect command is :
!connect 
jdbc:hive2://10.142.78.249:10010/default;principal=sparksql/nm-304-sa5212m4-bigdata-...@hadoop.sandbox.chinatelecom.cn
. However, it works well before. 
When my hadoop cluster equipped with federation ,it can not work well




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965568#comment-15965568
 ] 

João Pedro Jericó edited comment on SPARK-20294 at 4/12/17 8:56 AM:


Yes, if sampling ration is not given it infers for the first... still, maybe 
I'm nitpicking but my use case here is that I'm working with RDDs pulled from 
data that can vary from being ~50 rows long to 100k+, and I need to do a 
flatMap operation on them and then send them back to being a DF. However, the 
schema for the first row is not necessarily the schema for the whole thing 
because some Rows are missing some entries. The solution I used was to set the 
samplingRatio at 0.01, which works very well except for the RDDs below 100 
entries, where the sampling ratio is so small it has a chance of failing.

The solution I came up with was to set sampling ration as {code:python} 
min(100., N) / N {code}, which is either 1% of the DF or everything if the 
dataframe is smaller than 100, but I think this is not ideal. If we now that 
the rdd is not empty (the function tests for that before line 354), then we 
should at least use the first row as a fallback if the sampling fails.


was (Author: jpjandrade):
Yes, if sampling ration is not given it infers for the first... maybe I'm 
nitpicking but my use case here is that I'm working with RDDs pulled from data 
that can vary from being ~50 rows long to 100k+, and I need to do a flatMap 
operation on them and then send them back to being a DF. However, the schema 
for the first row is not necessarily the schema for the whole thing because 
some Rows are missing some entries. The solution I used was to set the 
samplingRatio at 0.01, which works very well except for the RDDs below 100 
entries, where the sampling ratio is so small it has a chance of failing.

The solution I came up with was to set sampling ration as {code:python} 
min(100., N) / N {code}, which is either 1% of the DF or everything if the 
dataframe is smaller than 100, but I think this is not ideal. If we now that 
the rdd is not empty (the function tests for that before line 354), then we 
should at least use the first row as a fallback if the sampling fails.

> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> ```python
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> ```
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-12 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

João Pedro Jericó updated SPARK-20294:
--
Description: 
Currently the _inferSchema function on 
[session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
 line 354 fails if applied to an RDD for which the sample call returns an empty 
RDD. This is possible for example if one has a small RDD but that needs the 
schema to be inferred by more than one Row. For example:

{code}
small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
small_rdd.toDF(samplingRatio=0.01).show()
{code}

This will fail with high probability because when sampling the small_rdd with 
the .sample method it will return an empty RDD most of the time. However, this 
is not the desired result because we are able to sample at least 1% of the RDD.

This is probably a problem with the other Spark APIs however I don't have the 
knowledge to look at the source code for other languages.

  was:
Currently the _inferSchema function on 
[session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
 line 354 fails if applied to an RDD for which the sample call returns an empty 
RDD. This is possible for example if one has a small RDD but that needs the 
schema to be inferred by more than one Row. For example:

```python
small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
small_rdd.toDF(samplingRatio=0.01).show()
```

This will fail with high probability because when sampling the small_rdd with 
the .sample method it will return an empty RDD most of the time. However, this 
is not the desired result because we are able to sample at least 1% of the RDD.

This is probably a problem with the other Spark APIs however I don't have the 
knowledge to look at the source code for other languages.


> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> {code}
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> {code}
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965568#comment-15965568
 ] 

João Pedro Jericó edited comment on SPARK-20294 at 4/12/17 8:57 AM:


Yes, if sampling ration is not given it infers for the first... still, maybe 
I'm nitpicking but my use case here is that I'm working with RDDs pulled from 
data that can vary from being ~50 rows long to 100k+, and I need to do a 
flatMap operation on them and then send them back to being a DF. However, the 
schema for the first row is not necessarily the schema for the whole thing 
because some Rows are missing some entries. The solution I used was to set the 
samplingRatio at 0.01, which works very well except for the RDDs below 100 
entries, where the sampling ratio is so small it has a chance of failing.

The solution I came up with was to set sampling ration as {code} min(100., N) / 
N {code}, which is either 1% of the DF or everything if the dataframe is 
smaller than 100, but I think this is not ideal. If we now that the rdd is not 
empty (the function tests for that before line 354), then we should at least 
use the first row as a fallback if the sampling fails.


was (Author: jpjandrade):
Yes, if sampling ration is not given it infers for the first... still, maybe 
I'm nitpicking but my use case here is that I'm working with RDDs pulled from 
data that can vary from being ~50 rows long to 100k+, and I need to do a 
flatMap operation on them and then send them back to being a DF. However, the 
schema for the first row is not necessarily the schema for the whole thing 
because some Rows are missing some entries. The solution I used was to set the 
samplingRatio at 0.01, which works very well except for the RDDs below 100 
entries, where the sampling ratio is so small it has a chance of failing.

The solution I came up with was to set sampling ration as {code:python} 
min(100., N) / N {code}, which is either 1% of the DF or everything if the 
dataframe is smaller than 100, but I think this is not ideal. If we now that 
the rdd is not empty (the function tests for that before line 354), then we 
should at least use the first row as a fallback if the sampling fails.

> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> ```python
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> ```
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20294) _inferSchema for RDDs fails if sample returns empty RDD

2017-04-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-20294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965568#comment-15965568
 ] 

João Pedro Jericó commented on SPARK-20294:
---

Yes, if sampling ration is not given it infers for the first... maybe I'm 
nitpicking but my use case here is that I'm working with RDDs pulled from data 
that can vary from being ~50 rows long to 100k+, and I need to do a flatMap 
operation on them and then send them back to being a DF. However, the schema 
for the first row is not necessarily the schema for the whole thing because 
some Rows are missing some entries. The solution I used was to set the 
samplingRatio at 0.01, which works very well except for the RDDs below 100 
entries, where the sampling ratio is so small it has a chance of failing.

The solution I came up with was to set sampling ration as {code:python} 
min(100., N) / N {code}, which is either 1% of the DF or everything if the 
dataframe is smaller than 100, but I think this is not ideal. If we now that 
the rdd is not empty (the function tests for that before line 354), then we 
should at least use the first row as a fallback if the sampling fails.

> _inferSchema for RDDs fails if sample returns empty RDD
> ---
>
> Key: SPARK-20294
> URL: https://issues.apache.org/jira/browse/SPARK-20294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: João Pedro Jericó
>Priority: Minor
>
> Currently the _inferSchema function on 
> [session.py](https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L354)
>  line 354 fails if applied to an RDD for which the sample call returns an 
> empty RDD. This is possible for example if one has a small RDD but that needs 
> the schema to be inferred by more than one Row. For example:
> ```python
> small_rdd = sc.parallelize([(1, 2), (2, 'foo')])
> small_rdd.toDF(samplingRatio=0.01).show()
> ```
> This will fail with high probability because when sampling the small_rdd with 
> the .sample method it will return an empty RDD most of the time. However, 
> this is not the desired result because we are able to sample at least 1% of 
> the RDD.
> This is probably a problem with the other Spark APIs however I don't have the 
> knowledge to look at the source code for other languages.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20237) Spark-1.6 current and later versions of memory management issues

2017-04-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20237:
--
Description: 
In spark-1.6 and later versions, there is a problem with its memory management 
UnifiedMemoryManager.
Spark.memory.storageFraction configuration should be at least storage Memory 
memory.
In the memory management UnifiedMemoryManager, the calculation of Execution 
memory can be up to storage how much memory can borrow,using {{val 
memoryReclaimableFromStorage = 
math.max(storageMemoryPool.memoryFree,storageMemoryPool.poolSize
- storageRegionSize)}}.
When {{storageMemoryPool.memoryFree > storageMemoryPool.poolSize - 
storageRegionSize}}, the size of the a will be chosen, that is,storage Memory 
will reduce the storageMemoryPool.memoryFree so much.
Because of {{storageMemoryPool.memoryFree > storageMemoryPool.poolSize - 
storageRegionSize}}, so {{storageMemoryPool.poolSize - 
storageMemoryPool.memoryFree < storageRegionSize}}
Now {{storageMemoryPool.poolSize < storageRegionSize,storageRegionSize}} is the 
smallest proportion of frame definition,so there is a problem.
To solve this problem, we define the function as  {{val 
memoryReclaimableFromStorage = storageMemoryPool.poolSize - storageRegionSize}}.


Experimental proof:
I added some log information to the UnifiedMemoryManager file as follows:
{code}
logInfo("storageMemoryPool.memoryFree 
%f".format(storageMemoryPool.memoryFree/1024.0/1024.0))   
logInfo("onHeapExecutionMemoryPool.memoryFree 
%f".format(onHeapExecutionMemoryPool.memoryFree/1024.0/1024.0)) 
logInfo("storageMemoryPool.memoryUsed %f".format( 
storageMemoryPool.memoryUsed/1024.0/1024.0)) 
logInfo("onHeapExecutionMemoryPool.memoryUsed 
%f".format(onHeapExecutionMemoryPool.memoryUsed/1024.0/1024.0)) 
logInfo("storageMemoryPool.poolSize %f".format( 
storageMemoryPool.poolSize/1024.0/1024.0))
logInfo("onHeapExecutionMemoryPool.poolSize 
%f".format(onHeapExecutionMemoryPool.poolSize/1024.0/1024.0))
{code}

  When I run the PageRank program, the input file for PageRank is generated by 
the BigDataBench-Chinese Academy of Sciences and is used to evaluate large data 
analysis system tools with a size of 676M. The information submitted is as 
follows:

{code}
./bin/spark-submit --class org.apache.spark.examples.SparkPageRank \
--master yarn \
--deploy-mode cluster \
--num-executors 1 \
--driver-memory 4g \
--executor-memory 7g \
--executor-cores 6 \
--queue thequeue \
./examples/target/scala-2.10/spark-examples-1.6.2-hadoop2.2.0.jar \
 /test/Google_genGraph_23.txt 
{code}

The configuration is as follows:

{code}
spark.memory.useLegacyMode=false
spark.memory.fraction=0.75
spark.memory.storageFraction=0.2
Log information is as follows:
17/02/28 11:07:34 INFO memory.UnifiedMemoryManager: 
storageMemoryPool.memoryFree 0.00
17/02/28 11:07:34 INFO memory.UnifiedMemoryManager: 
onHeapExecutionMemoryPool.memoryFree 5663.325877
17/02/28 11:07:34 INFO memory.UnifiedMemoryManager: 
storageMemoryPool.memoryUsed 0.299123 M
17/02/28 11:07:34 INFO memory.UnifiedMemoryManager: 
onHeapExecutionMemoryPool.memoryUsed 0.00
17/02/28 11:07:34 INFO memory.UnifiedMemoryManager: storageMemoryPool.poolSize 
0.299123
17/02/28 11:07:34 INFO memory.UnifiedMemoryManager: 
onHeapExecutionMemoryPool.poolSize 5663.325877
{code}

According to the configuration, storageMemoryPool.poolSize at least 1G or more, 
but the log information is only 0.299123 M, so there is an error.

  was:
In spark-1.6 and later versions, there is a problem with its memory management 
UnifiedMemoryManager.
Spark.memory.storageFraction configuration should be at least storage Memory 
memory.
In the memory management UnifiedMemoryManager, the calculation of Execution 
memory can be up to storage how much memory can borrow,using val 
memoryReclaimableFromStorage = 
math.max(storageMemoryPool.memoryFree,storageMemoryPool.poolSize
- storageRegionSize).
When storageMemoryPool.memoryFree > storageMemoryPool.poolSize - 
storageRegionSize, the size of the a will be chosen, that is,storage Memory 
will reduce the storageMemoryPool.memoryFree so much.
Because of storageMemoryPool.memoryFree > storageMemoryPool.poolSize - 
storageRegionSize, so storageMemoryPool.poolSize - storageMemoryPool.memoryFree 
< storageRegionSize
Now storageMemoryPool.poolSize < storageRegionSize,storageRegionSize is the 
smallest proportion of frame definition,so there is a problem.
To solve this problem, we define the function as  val 
memoryReclaimableFromStorage = storageMemoryPool.poolSize - storageRegionSize.


Experimental proof:
I added some log information to the UnifiedMemoryManager file as follows:
logInfo("storageMemoryPool.memoryFree 
%f".format(storageMemoryPool.memoryFree/1024.0/1024.0))   
logInfo("onHeapExecutionMemoryPool.memoryFree 

[jira] [Commented] (SPARK-20286) dynamicAllocation.executorIdleTimeout is ignored after unpersist

2017-04-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-20286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965556#comment-15965556
 ] 

Miguel Pérez commented on SPARK-20286:
--

I'm zero familiar with the code, but I'll try to send a merge request when I 
have time to look it up. I'll be grateful for any kind of help. Thank you.

> dynamicAllocation.executorIdleTimeout is ignored after unpersist
> 
>
> Key: SPARK-20286
> URL: https://issues.apache.org/jira/browse/SPARK-20286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Miguel Pérez
>
> With dynamic allocation enabled, it seems that executors with cached data 
> which are unpersisted are still being killed using the 
> {{dynamicAllocation.cachedExecutorIdleTimeout}} configuration, instead of 
> {{dynamicAllocation.executorIdleTimeout}}. Assuming the default configuration 
> ({{dynamicAllocation.cachedExecutorIdleTimeout = Infinity}}), an executor 
> with unpersisted data won't be released until the job ends.
> *How to reproduce*
> - Set different values for {{dynamicAllocation.executorIdleTimeout}} and 
> {{dynamicAllocation.cachedExecutorIdleTimeout}}
> - Load a file into a RDD and persist it
> - Execute an action on the RDD (like a count) so some executors are activated.
> - When the action has finished, unpersist the RDD
> - The application UI removes correctly the persisted data from the *Storage* 
> tab, but if you look in the *Executors* tab, you will find that the executors 
> remain *active* until ({{dynamicAllocation.cachedExecutorIdleTimeout}} is 
> reached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20302) Short circuit cast when from and to types are structurally the same

2017-04-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-20302.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Short circuit cast when from and to types are structurally the same
> ---
>
> Key: SPARK-20302
> URL: https://issues.apache.org/jira/browse/SPARK-20302
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>
> When we perform a cast expression and the from and to types are structurally 
> the same (having the same structure but different field names), we should be 
> able to skip the actual cast.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20305) Master may keep in the state of "COMPELETING_RECOVERY",then all the application registered cannot get resources, when the leader master change.

2017-04-12 Thread LvDongrong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LvDongrong updated SPARK-20305:
---
Attachment: failfetchresources.PNG

> Master may keep in the state of "COMPELETING_RECOVERY",then all the 
> application registered cannot get resources, when the leader master change.
> ---
>
> Key: SPARK-20305
> URL: https://issues.apache.org/jira/browse/SPARK-20305
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: LvDongrong
>Priority: Critical
> Attachments: failfetchresources.PNG
>
>
> Master may keep in the state of "COMPELETING_RECOVERY",then all the 
> application registered cannot get resources, when the leader master change.
> This happend when a exception was thrown during the Master trying to 
> recovery(completeRecovery method in the master.scala  ). Then the leader will 
> always in COMPLETING_RECOVERY state, for the leader can only change to alive 
> from state of RecoveryState.RECOVERING.
> !failfetchresources.PNG!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20305) Master may keep in the state of "COMPELETING_RECOVERY",then all the application registered cannot get resources, when the leader master change.

2017-04-12 Thread LvDongrong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LvDongrong updated SPARK-20305:
---
Attachment: (was: failfetchresources.PNG)

> Master may keep in the state of "COMPELETING_RECOVERY",then all the 
> application registered cannot get resources, when the leader master change.
> ---
>
> Key: SPARK-20305
> URL: https://issues.apache.org/jira/browse/SPARK-20305
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: LvDongrong
>Priority: Critical
>
> Master may keep in the state of "COMPELETING_RECOVERY",then all the 
> application registered cannot get resources, when the leader master change.
> This happend when a exception was thrown during the Master trying to 
> recovery(completeRecovery method in the master.scala  ). Then the leader will 
> always in COMPLETING_RECOVERY state, for the leader can only change to alive 
> from state of RecoveryState.RECOVERING.
> !failfetchresources.PNG!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20298) Spelling mistake: charactor

2017-04-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20298.
---
   Resolution: Fixed
 Assignee: Brendan Dwyer
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/17611

> Spelling mistake: charactor
> ---
>
> Key: SPARK-20298
> URL: https://issues.apache.org/jira/browse/SPARK-20298
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Brendan Dwyer
>Assignee: Brendan Dwyer
>Priority: Trivial
> Fix For: 2.2.0
>
>
> "charactor" should be "character"
> {code}
> R/pkg/R/DataFrame.R:2821:  stop("path should be charactor, NULL 
> or omitted.")
> R/pkg/R/DataFrame.R:2828:  stop("mode should be charactor or 
> omitted. It is 'error' by default.")
> R/pkg/R/DataFrame.R:3043:  stop("value should be an integer, 
> numeric, charactor or named list.")
> R/pkg/R/DataFrame.R:3055:  stop("Each item in value should be 
> an integer, numeric or charactor.")
> R/pkg/R/DataFrame.R:3601:  stop("outputMode should be charactor 
> or omitted.")
> R/pkg/R/SQLContext.R:609:stop("path should be charactor, NULL or 
> omitted.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2929:   "path should be 
> charactor, NULL or omitted.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2931:   "mode should be 
> charactor or omitted. It is 'error' by default.")
> R/pkg/inst/tests/testthat/test_sparkSQL.R:2950:   "path should be 
> charactor, NULL or omitted.")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20305) Master may keep in the state of "COMPELETING_RECOVERY",then all the application registered cannot get resources, when the leader master change.

2017-04-12 Thread LvDongrong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LvDongrong updated SPARK-20305:
---
Attachment: failfetchresources.PNG

> Master may keep in the state of "COMPELETING_RECOVERY",then all the 
> application registered cannot get resources, when the leader master change.
> ---
>
> Key: SPARK-20305
> URL: https://issues.apache.org/jira/browse/SPARK-20305
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: LvDongrong
>Priority: Critical
> Attachments: failfetchresources.PNG
>
>
> Master may keep in the state of "COMPELETING_RECOVERY",then all the 
> application registered cannot get resources, when the leader master change.
> This happend when a exception was thrown during the Master trying to 
> recovery(completeRecovery method in the master.scala  ). Then the leader will 
> always in COMPLETING_RECOVERY state, for the leader can only change to alive 
> from state of RecoveryState.RECOVERING.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20305) Master may keep in the state of "COMPELETING_RECOVERY",then all the application registered cannot get resources, when the leader master change.

2017-04-12 Thread LvDongrong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LvDongrong updated SPARK-20305:
---
Description: 
Master may keep in the state of "COMPELETING_RECOVERY",then all the application 
registered cannot get resources, when the leader master change.
This happend when a exception was thrown during the Master trying to 
recovery(completeRecovery method in the master.scala  ). Then the leader will 
always in COMPLETING_RECOVERY state, for the leader can only change to alive 
from state of RecoveryState.RECOVERING.
!failfetchresources.PNG!

  was:
Master may keep in the state of "COMPELETING_RECOVERY",then all the application 
registered cannot get resources, when the leader master change.
This happend when a exception was thrown during the Master trying to 
recovery(completeRecovery method in the master.scala  ). Then the leader will 
always in COMPLETING_RECOVERY state, for the leader can only change to alive 
from state of RecoveryState.RECOVERING.


> Master may keep in the state of "COMPELETING_RECOVERY",then all the 
> application registered cannot get resources, when the leader master change.
> ---
>
> Key: SPARK-20305
> URL: https://issues.apache.org/jira/browse/SPARK-20305
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: LvDongrong
>Priority: Critical
> Attachments: failfetchresources.PNG
>
>
> Master may keep in the state of "COMPELETING_RECOVERY",then all the 
> application registered cannot get resources, when the leader master change.
> This happend when a exception was thrown during the Master trying to 
> recovery(completeRecovery method in the master.scala  ). Then the leader will 
> always in COMPLETING_RECOVERY state, for the leader can only change to alive 
> from state of RecoveryState.RECOVERING.
> !failfetchresources.PNG!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >