[jira] [Resolved] (SPARK-13616) Let SQLBuilder convert logical plan without a Project on top of it

2016-03-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13616.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.0.0

> Let SQLBuilder convert logical plan without a Project on top of it
> --
>
> Key: SPARK-13616
> URL: https://issues.apache.org/jira/browse/SPARK-13616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> It is possibly that a logical plan has been removed Project from the top of 
> it. Or the plan doesn't has a top Project from the beginning. Currently the 
> SQLBuilder can't convert such plans back to SQL. This issue is opened to add 
> this feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-02 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177315#comment-15177315
 ] 

Mark Grover commented on SPARK-12177:
-

One more thing as a potential con for Proposal 1:
There are places that have to use the kafka artifact. 'examples' subproject is 
a good example of that. The subproject pulls kafka artifact as a dependency and 
has example for Kafka usage. However, it can't depend on the new 
implementation's artifact at the same time because they depend on different 
versions of kafka. Therefore, unless I am missing something, new 
implementation's example can't go there. 

And, that's fine, we can put it within the subproject itself, instead of 
examples, but that won't necessarily work with tooling like run-example, etc.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13621) TestExecutor.scala needs to be moved to test package

2016-03-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13621.
-
   Resolution: Fixed
 Assignee: Devaraj K
Fix Version/s: 2.0.0

> TestExecutor.scala needs to be moved to test package
> 
>
> Key: SPARK-13621
> URL: https://issues.apache.org/jira/browse/SPARK-13621
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Devaraj K
>Assignee: Devaraj K
>Priority: Trivial
> Fix For: 2.0.0
>
>
> TestExecutor.scala is in the package 
> core\src\main\scala\org\apache\spark\deploy\client\ and it is getting used 
> only by test classes. It needs to be moved to test package i.e. 
> core\src\test\scala\org\apache\spark\deploy\client\ since the purpose of it 
> is for test.
> And also core\src\main\scala\org\apache\spark\deploy\client\TestClient.scala 
> is not getting used any where and present in the src, I think it can be 
> removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13642) Inconsistent finishing state between driver and AM

2016-03-02 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-13642:
---

 Summary: Inconsistent finishing state between driver and AM 
 Key: SPARK-13642
 URL: https://issues.apache.org/jira/browse/SPARK-13642
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.6.0
Reporter: Saisai Shao


Currently when running Spark on Yarn with yarn cluster mode, the default 
application final state is "SUCCEED", if any exception is occurred, this final 
state will be changed to "FAILED" and trigger the reattempt if possible. 

This is OK in normal case, but there's a race condition when AM received a 
signal (SIGTERM), no any exception is occurred. In this situation, shutdown 
hook will be invoked and marked this application as finished with success, and 
there's no another attempt.

In such situation, actually from Spark's aspect this application is failed and 
need another attempt, but from Yarn's aspect the application is finished with 
success.

This could happened in NM failure situation, the failure of NM will send 
SIGTERM to AM, AM should make this attempt as failure and rerun again, not 
invoke unregister.

So to increase the chance of this race condition, here is the reproduced code:

{code}
val sc = ...
Thread.sleep(3L)
sc.parallelize(1 to 100).collect()
{code}

If the AM is failed in sleeping, there's no exception been thrown, so from 
Yarn's point this application is finished successfully, but from Spark's point, 
this application should be reattempted.

So basically, I think only after the finish of user class, we could mark this 
application as "SUCCESS", otherwise, especially in the signal stopped scenario, 
it would be better to mark as failed and try again (except explicitly KILL 
command by yarn).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13642) Inconsistent finishing state between driver and AM

2016-03-02 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-13642:

Description: 
Currently when running Spark on Yarn with yarn cluster mode, the default 
application final state is "SUCCEED", if any exception is occurred, this final 
state will be changed to "FAILED" and trigger the reattempt if possible. 

This is OK in normal case, but if there's a race condition when AM received a 
signal (SIGTERM) and no any exception is occurred. In this situation, shutdown 
hook will be invoked and marked this application as finished with success, and 
there's no another attempt.

In such situation, actually from Spark's aspect this application is failed and 
need another attempt, but from Yarn's aspect the application is finished with 
success.

This could happened in NM failure situation, the failure of NM will send 
SIGTERM to AM, AM should make this attempt as failure and rerun again, not 
invoke unregister.

So to increase the chance of this race condition, here is the reproduced code:

{code}
val sc = ...
Thread.sleep(3L)
sc.parallelize(1 to 100).collect()
{code}

If the AM is failed in sleeping, there's no exception been thrown, so from 
Yarn's point this application is finished successfully, but from Spark's point, 
this application should be reattempted.

So basically, I think only after the finish of user class, we could mark this 
application as "SUCCESS", otherwise, especially in the signal stopped scenario, 
it would be better to mark as failed and try again (except explicitly KILL 
command by yarn).


  was:
Currently when running Spark on Yarn with yarn cluster mode, the default 
application final state is "SUCCEED", if any exception is occurred, this final 
state will be changed to "FAILED" and trigger the reattempt if possible. 

This is OK in normal case, but there's a race condition when AM received a 
signal (SIGTERM), no any exception is occurred. In this situation, shutdown 
hook will be invoked and marked this application as finished with success, and 
there's no another attempt.

In such situation, actually from Spark's aspect this application is failed and 
need another attempt, but from Yarn's aspect the application is finished with 
success.

This could happened in NM failure situation, the failure of NM will send 
SIGTERM to AM, AM should make this attempt as failure and rerun again, not 
invoke unregister.

So to increase the chance of this race condition, here is the reproduced code:

{code}
val sc = ...
Thread.sleep(3L)
sc.parallelize(1 to 100).collect()
{code}

If the AM is failed in sleeping, there's no exception been thrown, so from 
Yarn's point this application is finished successfully, but from Spark's point, 
this application should be reattempted.

So basically, I think only after the finish of user class, we could mark this 
application as "SUCCESS", otherwise, especially in the signal stopped scenario, 
it would be better to mark as failed and try again (except explicitly KILL 
command by yarn).



> Inconsistent finishing state between driver and AM 
> ---
>
> Key: SPARK-13642
> URL: https://issues.apache.org/jira/browse/SPARK-13642
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>
> Currently when running Spark on Yarn with yarn cluster mode, the default 
> application final state is "SUCCEED", if any exception is occurred, this 
> final state will be changed to "FAILED" and trigger the reattempt if 
> possible. 
> This is OK in normal case, but if there's a race condition when AM received a 
> signal (SIGTERM) and no any exception is occurred. In this situation, 
> shutdown hook will be invoked and marked this application as finished with 
> success, and there's no another attempt.
> In such situation, actually from Spark's aspect this application is failed 
> and need another attempt, but from Yarn's aspect the application is finished 
> with success.
> This could happened in NM failure situation, the failure of NM will send 
> SIGTERM to AM, AM should make this attempt as failure and rerun again, not 
> invoke unregister.
> So to increase the chance of this race condition, here is the reproduced code:
> {code}
> val sc = ...
> Thread.sleep(3L)
> sc.parallelize(1 to 100).collect()
> {code}
> If the AM is failed in sleeping, there's no exception been thrown, so from 
> Yarn's point this application is finished successfully, but from Spark's 
> point, this application should be reattempted.
> So basically, I think only after the finish of user class, we could mark this 
> application as "SUCCESS", otherwise, especially in the signal stopped 
> scenario, it would be better to mark as failed and try again (except 
> explicitly KILL command by yarn).



--
Th

[jira] [Commented] (SPARK-13642) Inconsistent finishing state between driver and AM

2016-03-02 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177347#comment-15177347
 ] 

Saisai Shao commented on SPARK-13642:
-

[~tgraves] [~vanzin], would you please comment on this, why the default 
application final state is "SUCCESS"? Is it better to mark this application as 
"SUCCESS" only after user class is exited? Thanks a lot.

> Inconsistent finishing state between driver and AM 
> ---
>
> Key: SPARK-13642
> URL: https://issues.apache.org/jira/browse/SPARK-13642
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>
> Currently when running Spark on Yarn with yarn cluster mode, the default 
> application final state is "SUCCEED", if any exception is occurred, this 
> final state will be changed to "FAILED" and trigger the reattempt if 
> possible. 
> This is OK in normal case, but if there's a race condition when AM received a 
> signal (SIGTERM) and no any exception is occurred. In this situation, 
> shutdown hook will be invoked and marked this application as finished with 
> success, and there's no another attempt.
> In such situation, actually from Spark's aspect this application is failed 
> and need another attempt, but from Yarn's aspect the application is finished 
> with success.
> This could happened in NM failure situation, the failure of NM will send 
> SIGTERM to AM, AM should make this attempt as failure and rerun again, not 
> invoke unregister.
> So to increase the chance of this race condition, here is the reproduced code:
> {code}
> val sc = ...
> Thread.sleep(3L)
> sc.parallelize(1 to 100).collect()
> {code}
> If the AM is failed in sleeping, there's no exception been thrown, so from 
> Yarn's point this application is finished successfully, but from Spark's 
> point, this application should be reattempted.
> So basically, I think only after the finish of user class, we could mark this 
> application as "SUCCESS", otherwise, especially in the signal stopped 
> scenario, it would be better to mark as failed and try again (except 
> explicitly KILL command by yarn).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2016-03-02 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13446:
--
Issue Type: Improvement  (was: Bug)

Can't you build against the newer version of Hive? that much is needed of 
course; I don't know if it's all that's needed.

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13311) prettyString of IN is not good

2016-03-02 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177357#comment-15177357
 ] 

Xiao Li commented on SPARK-13311:
-

After the merge of https://github.com/apache/spark/pull/10757, I think the 
problem is resolved. 

> prettyString of IN is not good
> --
>
> Key: SPARK-13311
> URL: https://issues.apache.org/jira/browse/SPARK-13311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>
> In(i_class,[Ljava.lang.Object;@1a575883))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2016-03-02 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177356#comment-15177356
 ] 

Adrian Wang commented on SPARK-13446:
-

That's not enough. We still need some code change.

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13568) Create feature transformer to impute missing values

2016-03-02 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177368#comment-15177368
 ] 

Nick Pentreath commented on SPARK-13568:


Ok - the Imputer will need to compute column stats ignoring NaNs, so 
SPARK-13639 should add that (whether as default behaviour, or an optional 
argument)

> Create feature transformer to impute missing values
> ---
>
> Key: SPARK-13568
> URL: https://issues.apache.org/jira/browse/SPARK-13568
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> It is quite common to encounter missing values in data sets. It would be 
> useful to implement a {{Transformer}} that can impute missing data points, 
> similar to e.g. {{Imputer}} in 
> [scikit-learn|http://scikit-learn.org/dev/modules/preprocessing.html#imputation-of-missing-values].
> Initially, options for imputation could include {{mean}}, {{median}} and 
> {{most frequent}}, but we could add various other approaches. Where possible 
> existing DataFrame code can be used (e.g. for approximate quantiles etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13600) Incorrect number of buckets in QuantileDiscretizer

2016-03-02 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177392#comment-15177392
 ] 

Xusen Yin commented on SPARK-13600:
---

Vote for the new method.

> Incorrect number of buckets in QuantileDiscretizer
> --
>
> Key: SPARK-13600
> URL: https://issues.apache.org/jira/browse/SPARK-13600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Oliver Pierson
>Assignee: Oliver Pierson
>
> Under certain circumstances, QuantileDiscretizer fails to calculate the 
> correct splits resulting in an incorrect number of buckets/bins.
> E.g.
> val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
> val discretizer = new 
> QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
> discretizer.fit(df).getSplits
> gives:
> Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
> which corresponds to 6 buckets (not 5).
> The problem appears to be in the QuantileDiscretizer.findSplitsCandidates 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13643) Create SparkSession interface

2016-03-02 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-13643:
---

 Summary: Create SparkSession interface
 Key: SPARK-13643
 URL: https://issues.apache.org/jira/browse/SPARK-13643
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13612) Multiplication of BigDecimal columns not working as expected

2016-03-02 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177428#comment-15177428
 ] 

Liang-Chi Hsieh commented on SPARK-13612:
-

Because the internal type for BigDecimal would be Decimal(38, 18) by default, 
(you can print the schema of x and y), the result scale of x("a") * y("b") will 
be 18 + 18 = 36. That is detected to have overflow so you get a null value back.

You can cast the decimal column to proper precision and scale, e.g.:

{{code}}

val newX = x.withColumn("a", x("a").cast(DecimalType(10, 1)))
val newY = y.withColumn("b", y("b").cast(DecimalType(10, 1)))

newX.join(newY, newX("id") === newY("id")).withColumn("z", newX("a") * 
newY("b")).show

+---++---++--+
| id|   a| id|   b| z|
+---++---++--+
|  1|10.0|  1|10.0|100.00|
+---++---++--+

{{code}}


> Multiplication of BigDecimal columns not working as expected
> 
>
> Key: SPARK-13612
> URL: https://issues.apache.org/jira/browse/SPARK-13612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Varadharajan
>
> Please consider the below snippet:
> {code}
> case class AM(id: Int, a: BigDecimal)
> case class AX(id: Int, b: BigDecimal)
> val x = sc.parallelize(List(AM(1, 10))).toDF
> val y = sc.parallelize(List(AX(1, 10))).toDF
> x.join(y, x("id") === y("id")).withColumn("z", x("a") * y("b")).show
> {code}
> output:
> {code}
> | id|   a| id|   b|   z|
> |  1|10.00...|  1|10.00...|null|
> {code}
> Here the multiplication of the columns ("z") return null instead of 100.
> As of now we are using the below workaround, but definitely looks like a 
> serious issue.
> {code}
> x.join(y, x("id") === y("id")).withColumn("z", x("a") / (expr("1") / 
> y("b"))).show
> {code}
> {code}
> | id|   a| id|   b|   z|
> |  1|10.00...|  1|10.00...|100.0...|
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13612) Multiplication of BigDecimal columns not working as expected

2016-03-02 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177428#comment-15177428
 ] 

Liang-Chi Hsieh edited comment on SPARK-13612 at 3/3/16 7:35 AM:
-

Because the internal type for BigDecimal would be Decimal(38, 18) by default, 
(you can print the schema of x and y), the result scale of x("a") * y("b") will 
be 18 + 18 = 36. That is detected to have overflow so you get a null value back.

You can cast the decimal column to proper precision and scale, e.g.:

{code}

val newX = x.withColumn("a", x("a").cast(DecimalType(10, 1)))
val newY = y.withColumn("b", y("b").cast(DecimalType(10, 1)))

newX.join(newY, newX("id") === newY("id")).withColumn("z", newX("a") * 
newY("b")).show

+---++---++--+
| id|   a| id|   b| z|
+---++---++--+
|  1|10.0|  1|10.0|100.00|
+---++---++--+

{code}



was (Author: viirya):
Because the internal type for BigDecimal would be Decimal(38, 18) by default, 
(you can print the schema of x and y), the result scale of x("a") * y("b") will 
be 18 + 18 = 36. That is detected to have overflow so you get a null value back.

You can cast the decimal column to proper precision and scale, e.g.:

{{code}}

val newX = x.withColumn("a", x("a").cast(DecimalType(10, 1)))
val newY = y.withColumn("b", y("b").cast(DecimalType(10, 1)))

newX.join(newY, newX("id") === newY("id")).withColumn("z", newX("a") * 
newY("b")).show

+---++---++--+
| id|   a| id|   b| z|
+---++---++--+
|  1|10.0|  1|10.0|100.00|
+---++---++--+

{{code}}


> Multiplication of BigDecimal columns not working as expected
> 
>
> Key: SPARK-13612
> URL: https://issues.apache.org/jira/browse/SPARK-13612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Varadharajan
>
> Please consider the below snippet:
> {code}
> case class AM(id: Int, a: BigDecimal)
> case class AX(id: Int, b: BigDecimal)
> val x = sc.parallelize(List(AM(1, 10))).toDF
> val y = sc.parallelize(List(AX(1, 10))).toDF
> x.join(y, x("id") === y("id")).withColumn("z", x("a") * y("b")).show
> {code}
> output:
> {code}
> | id|   a| id|   b|   z|
> |  1|10.00...|  1|10.00...|null|
> {code}
> Here the multiplication of the columns ("z") return null instead of 100.
> As of now we are using the below workaround, but definitely looks like a 
> serious issue.
> {code}
> x.join(y, x("id") === y("id")).withColumn("z", x("a") / (expr("1") / 
> y("b"))).show
> {code}
> {code}
> | id|   a| id|   b|   z|
> |  1|10.00...|  1|10.00...|100.0...|
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12941) Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype

2016-03-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177436#comment-15177436
 ] 

Apache Spark commented on SPARK-12941:
--

User 'thomastechs' has created a pull request for this issue:
https://github.com/apache/spark/pull/11489

> Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR 
> datatype
> --
>
> Key: SPARK-12941
> URL: https://issues.apache.org/jira/browse/SPARK-12941
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
> Environment: Apache Spark 1.4.2.2
>Reporter: Jose Martinez Poblete
>Assignee: Thomas Sebastian
> Fix For: 1.4.2, 1.5.3, 1.6.2, 2.0.0
>
>
> When exporting data from Spark to Oracle, string datatypes are translated to 
> TEXT for Oracle, this is leading to the following error
> {noformat}
> java.sql.SQLSyntaxErrorException: ORA-00902: invalid datatype
> {noformat}
> As per the following code:
> https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L144
> See also:
> http://stackoverflow.com/questions/31287182/writing-to-oracle-database-using-apache-spark-1-4-0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13589) Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType

2016-03-02 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177437#comment-15177437
 ] 

Liang-Chi Hsieh commented on SPARK-13589:
-

[~lian cheng] I think this is already solved in SPARK-13537.

> Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType
> ---
>
> Key: SPARK-13589
> URL: https://issues.apache.org/jira/browse/SPARK-13589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>  Labels: flaky-test
>
> Here are a few sample build failures caused by this test case:
> # 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52164/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/
> # 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52154/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/
> # 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52153/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/
> (I've pinned these builds on Jenkins so that they won't be cleaned up.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit

2016-03-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13635.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11483
[https://github.com/apache/spark/pull/11483]

> Enable LimitPushdown optimizer rule because we have whole-stage codegen for 
> Limit
> -
>
> Key: SPARK-13635
> URL: https://issues.apache.org/jira/browse/SPARK-13635
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> LimitPushdown optimizer rule has been disabled due to no whole-stage codegen 
> for Limit. As we have whole-stage codegen for Limit now, we should enable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13531) Some DataFrame joins stopped working with UnsupportedOperationException: No size estimation available for objects

2016-03-02 Thread Zuo Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177441#comment-15177441
 ] 

Zuo Wang commented on SPARK-13531:
--

Caused by the commit in https://issues.apache.org/jira/browse/SPARK-13329

> Some DataFrame joins stopped working with UnsupportedOperationException: No 
> size estimation available for objects
> -
>
> Key: SPARK-13531
> URL: https://issues.apache.org/jira/browse/SPARK-13531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: koert kuipers
>Priority: Minor
>
> this is using spark 2.0.0-SNAPSHOT
> dataframe df1:
> schema:
> {noformat}StructType(StructField(x,IntegerType,true)){noformat}
> explain:
> {noformat}== Physical Plan ==
> MapPartitions , obj#135: object, [if (input[0, object].isNullAt) 
> null else input[0, object].get AS x#128]
> +- MapPartitions , createexternalrow(if (isnull(x#9)) null else 
> x#9), [input[0, object] AS obj#135]
>+- WholeStageCodegen
>   :  +- Project [_1#8 AS x#9]
>   : +- Scan ExistingRDD[_1#8]{noformat}
> show:
> {noformat}+---+
> |  x|
> +---+
> |  2|
> |  3|
> +---+{noformat}
> dataframe df2:
> schema:
> {noformat}StructType(StructField(x,IntegerType,true), 
> StructField(y,StringType,true)){noformat}
> explain:
> {noformat}== Physical Plan ==
> MapPartitions , createexternalrow(x#2, if (isnull(y#3)) null else 
> y#3.toString), [if (input[0, object].isNullAt) null else input[0, object].get 
> AS x#130,if (input[0, object].isNullAt) null else staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
> object].get, true) AS y#131]
> +- WholeStageCodegen
>:  +- Project [_1#0 AS x#2,_2#1 AS y#3]
>: +- Scan ExistingRDD[_1#0,_2#1]{noformat}
> show:
> {noformat}+---+---+
> |  x|  y|
> +---+---+
> |  1|  1|
> |  2|  2|
> |  3|  3|
> +---+---+{noformat}
> i run:
> df1.join(df2, Seq("x")).show
> i get:
> {noformat}java.lang.UnsupportedOperationException: No size estimation 
> available for objects.
> at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323)
> at 
> org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87){noformat}
> now sure what changed, this ran about a week ago without issues (in our 
> internal unit tests). it is fully reproducible, however when i tried to 
> minimize the issue i could not reproduce it by just creating data frames in 
> the repl with the same contents, so it probably has something to do with way 
> these are created (from Row objects and StructTypes).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit

2016-03-02 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177445#comment-15177445
 ] 

Liang-Chi Hsieh commented on SPARK-13635:
-

[~davies] Can you help update the Assignee field? Thanks!

> Enable LimitPushdown optimizer rule because we have whole-stage codegen for 
> Limit
> -
>
> Key: SPARK-13635
> URL: https://issues.apache.org/jira/browse/SPARK-13635
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> LimitPushdown optimizer rule has been disabled due to no whole-stage codegen 
> for Limit. As we have whole-stage codegen for Limit now, we should enable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3