from:"Dilip Biswal"

[jira] [Created] (SPARK-23275) hive/tests have been failing when run locally on the laptop (Mac) with OOM

2018-01-30 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-23275:


 Summary: hive/tests have been failing when run locally on the 
laptop (Mac) with OOM 
 Key: SPARK-23275
 URL: https://issues.apache.org/jira/browse/SPARK-23275
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Dilip Biswal


hive tests have been failing when they are run locally (Mac Os)  after a recent 
change in the trunk. After running the tests for some time, the test fails with 
OOM with  Error: unable to create new native thread. 

I noticed the thread count goes all the way up to 2000+ after which we start 
getting these OOM errors. Most of the threads seem to be related to the 
connection pool in hive metastore (BoneCP-x- ). This behaviour change 
is happening after we made the following change to HiveClientImpl.reset()

{code}
 def reset(): Unit = withHiveState {
try {
  // code
} finally {
  runSqlHive("USE default")  ===> this is causing the issue
}
{code}


I am proposing to temporarily back-out part of a fix made to address 
SPARK-23000 to resolve this issue while we work-out the exact reason for this 
sudden increase in thread counts.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23095) Decorrelation of scalar subquery fails with java.util.NoSuchElementException.

2018-01-16 Thread Dilip Biswal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dilip Biswal updated SPARK-23095:
-
Description: 
The following SQL involving scalar correlated query returns a map exception.
{code:java}
SELECT t1a
FROM   t1
WHERE  t1a = (SELECT   count
              FROM     t2
              WHERE    t2c = t1c
              HAVING   count >= 1)

{code}
{code:java}
 
key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) 
java.util.NoSuchElementException: key not found: 
ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e)         at 
scala.collection.MapLike$class.default(MapLike.scala:228)         at 
scala.collection.AbstractMap.default(Map.scala:59)         at 
scala.collection.MapLike$class.apply(MapLike.scala:141)         at 
scala.collection.AbstractMap.apply(Map.scala:59)         at 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378)
         at 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:430)
         at 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:426)
{code}

In this case, after evaluating the HAVING clause "count(*) > 1" statically
against the binding of aggregtation result on empty input, we determine
that this query will not have a the count bug. We should simply return
the evalSubqueryOnZeroTups with empty value.

  was:
The following SQL involving scalar correlated query returns a map exception.
{code:java}
SELECT t1a
FROM   t1
WHERE  t1a = (SELECT   count
              FROM     t2
              WHERE    t2c = t1c
              HAVING   count >= 1)

{code}
{code:java}
 
key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) 
java.util.NoSuchElementException: key not found: 
ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e)         at 
scala.collection.MapLike$class.default(MapLike.scala:228)         at 
scala.collection.AbstractMap.default(Map.scala:59)         at 
scala.collection.MapLike$class.apply(MapLike.scala:141)         at 
scala.collection.AbstractMap.apply(Map.scala:59)         at 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378)
         at 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:430)
         at 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:426)
{code}
 

 

 

In this case, after evaluating the HAVING clause "count(*) > 1" statically

against the binding of aggregtation result on empty input, we determine

that this query will not have a the count bug. We should simply return

the evalSubqueryOnZeroTups with empty value.


> Decorrelation of scalar subquery fails with java.util.NoSuchElementException.
> -
>
> Key: SPARK-23095
> URL: https://issues.apache.org/jira/browse/SPARK-23095
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dilip Biswal
>Priority: Major
>
> The following SQL involving scalar correlated query returns a map exception.
> {code:java}
> SELECT t1a
> FROM   t1
> WHERE  t1a = (SELECT   count
>               FROM     t2
>               WHERE    t2c = t1c
>               HAVING   count >= 1)
> {code}
> {code:java}
>  
> key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) 
> java.util.NoSuchElementException: key not found: 
> ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e)         at 
> scala.collection.MapLike$class.default(MapLike.scala:228)         at 
> scala.collection.AbstractMap.default(Map.scala:59)         at 
> scala.collection.MapLike$class.apply(MapLike.scala:141)         at 
> scala.collection.AbstractMap.apply(Map.scala:59)         at 
> org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378)
>          at 
> org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun

[jira] [Updated] (SPARK-23095) Decorrelation of scalar subquery fails with java.util.NoSuchElementException.

2018-01-16 Thread Dilip Biswal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dilip Biswal updated SPARK-23095:
-
Description: 
The following SQL involving scalar correlated query returns a map exception.
{code:java}
SELECT t1a
FROM   t1
WHERE  t1a = (SELECT   count
              FROM     t2
              WHERE    t2c = t1c
              HAVING   count >= 1)

{code}
{code:java}
 
key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) 
java.util.NoSuchElementException: key not found: 
ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e)         at 
scala.collection.MapLike$class.default(MapLike.scala:228)         at 
scala.collection.AbstractMap.default(Map.scala:59)         at 
scala.collection.MapLike$class.apply(MapLike.scala:141)         at 
scala.collection.AbstractMap.apply(Map.scala:59)         at 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378)
         at 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:430)
         at 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:426)
{code}
 

 

 

In this case, after evaluating the HAVING clause "count(*) > 1" statically

against the binding of aggregtation result on empty input, we determine

that this query will not have a the count bug. We should simply return

the evalSubqueryOnZeroTups with empty value.

  was:
The following SQL involving scalar correlated query returns a map exception.

 

SELECT t1a

FROM   t1

WHERE  t1a = (SELECT   count(*)

              FROM     t2

              WHERE    t2c = t1c

              HAVING   count(*) >= 1)

 

key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e)

java.util.NoSuchElementException: key not found: 
ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e)

        at scala.collection.MapLike$class.default(MapLike.scala:228)

        at scala.collection.AbstractMap.default(Map.scala:59)

        at scala.collection.MapLike$class.apply(MapLike.scala:141)

        at scala.collection.AbstractMap.apply(Map.scala:59)

        at 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378)

        at 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:430)

        at 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:426)

 

In this case, after evaluating the HAVING clause "count(*) > 1" statically

against the binding of aggregtation result on empty input, we determine

that this query will not have a the count bug. We should simply return

the evalSubqueryOnZeroTups with empty value.


> Decorrelation of scalar subquery fails with java.util.NoSuchElementException.
> -
>
> Key: SPARK-23095
> URL: https://issues.apache.org/jira/browse/SPARK-23095
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dilip Biswal
>Priority: Major
>
> The following SQL involving scalar correlated query returns a map exception.
> {code:java}
> SELECT t1a
> FROM   t1
> WHERE  t1a = (SELECT   count
>               FROM     t2
>               WHERE    t2c = t1c
>               HAVING   count >= 1)
> {code}
> {code:java}
>  
> key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e) 
> java.util.NoSuchElementException: key not found: 
> ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e)         at 
> scala.collection.MapLike$class.default(MapLike.scala:228)         at 
> scala.collection.AbstractMap.default(Map.scala:59)         at 
> scala.collection.MapLike$class.apply(MapLike.scala:141)         at 
> scala.collection.AbstractMap.apply(Map.scala:59)         at 
> org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378)
>          at 
> org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun

[jira] [Created] (SPARK-23095) Decorrelation of scalar subquery fails with java.util.NoSuchElementException.

2018-01-16 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-23095:


 Summary: Decorrelation of scalar subquery fails with 
java.util.NoSuchElementException.
 Key: SPARK-23095
 URL: https://issues.apache.org/jira/browse/SPARK-23095
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Dilip Biswal


The following SQL involving scalar correlated query returns a map exception.

 

SELECT t1a

FROM   t1

WHERE  t1a = (SELECT   count(*)

              FROM     t2

              WHERE    t2c = t1c

              HAVING   count(*) >= 1)

 

key not found: ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e)

java.util.NoSuchElementException: key not found: 
ExprId(278,786682bb-41f9-4bd5-a397-928272cc8e4e)

        at scala.collection.MapLike$class.default(MapLike.scala:228)

        at scala.collection.AbstractMap.default(Map.scala:59)

        at scala.collection.MapLike$class.apply(MapLike.scala:141)

        at scala.collection.AbstractMap.apply(Map.scala:59)

        at 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$.org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$evalSubqueryOnZeroTups(subquery.scala:378)

        at 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:430)

        at 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery$$anonfun$org$apache$spark$sql$catalyst$optimizer$RewriteCorrelatedScalarSubquery$$constructLeftJoins$1.apply(subquery.scala:426)

 

In this case, after evaluating the HAVING clause "count(*) > 1" statically

against the binding of aggregtation result on empty input, we determine

that this query will not have a the count bug. We should simply return

the evalSubqueryOnZeroTups with empty value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Welcoming Tejas Patil as a Spark committer

2017-10-03 Thread Dilip Biswal

Congratulations , Tejas!
 
-- Dilip
 
 
- Original message -From: Suresh Thalamati To: "dev@spark.apache.org" Cc:Subject: Re: Welcoming Tejas Patil as a Spark committerDate: Tue, Oct 3, 2017 12:01 PM 
Congratulations , Tejas!-suresh> On Sep 29, 2017, at 12:58 PM, Matei Zaharia  wrote:>> Hi all,>> The Spark PMC recently added Tejas Patil as a committer on the> project. Tejas has been contributing across several areas of Spark for> a while, focusing especially on scalability issues and SQL. Please> join me in welcoming Tejas!>> Matei>> -> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org>-To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread Dilip Biswal

Congratulations, Jerry!
 
Regards,Dilip Biswal 
 
 
- Original message -From: Takuya UESHIN To: Saisai Shao Cc: dev Subject: Re: Welcoming Saisai (Jerry) Shao as a committerDate: Mon, Aug 28, 2017 10:22 PM 
Congratulations, Jerry!
 
 
On Tue, Aug 29, 2017 at 2:14 PM, Suresh Thalamati <suresh.thalam...@gmail.com> wrote:

Congratulations, Jerry
> On Aug 28, 2017, at 6:28 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:>> Hi everyone,>> The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai has been contributing to many areas of the project for a long time, so it’s great to see him join. Join me in thanking and congratulating him!>> Matei> -> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org>-To unsubscribe e-mail: dev-unsubscribe@spark.apache.org  

 --

Takuya UESHINTokyo, Japanhttp://twitter.com/ueshin
 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-08 Thread Dilip Biswal

Congratulations Hyukjin and Sameer !! 
Regards,Dilip BiswalTel: 408-463-4980dbis...@us.ibm.com
 
 
- Original message -From: Liang-Chi Hsieh To: dev@spark.apache.orgCc:Subject: Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committersDate: Tue, Aug 8, 2017 12:29 AM 
Congrats to Hyukjin and Sameer!Xiao Li wrote> Congrats!>>> On Mon, 7 Aug 2017 at 10:21 PM Takuya UESHIN <> ueshin@> > wrote:>>> Congrats! On Tue, Aug 8, 2017 at 11:38 AM, Felix Cheung <> felixcheung_m@> >>> wrote:> Congrats!!>> -->>> *From:* Kevin Kim (Sangwoo) <> kevin@>  *Sent:* Monday, August 7, 2017 7:30:01 PM>>> *To:* Hyukjin Kwon; dev>>> *Cc:* Bryan Cutler; Mridul Muralidharan; Matei Zaharia; Holden Karau>>> *Subject:* Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers>> Thanks for all of your hard work, Hyukjin and Sameer. Congratulations!!> 2017년 8월 8일 (화) 오전 9:44, Hyukjin Kwon <> gurwls223@> >님이 작성:>>> Thank you all. Will do my best! 2017-08-08 8:53 GMT+09:00 Holden Karau <> holden@> >:> Congrats!>> On Mon, Aug 7, 2017 at 3:54 PM Bryan Cutler <> cutlerb@> > wrote:>>> Great work Hyukjin and Sameer! On Mon, Aug 7, 2017 at 10:22 AM, Mridul Muralidharan <> mridul@> > > wrote:> Congratulations Hyukjin, Sameer !>> Regards,>>> Mridul>> On Mon, Aug 7, 2017 at 8:53 AM, Matei Zaharia < matei.zaharia@>> wrote:>>> > Hi everyone,>>>  > The Spark PMC recently voted to add Hyukjin Kwon and Sameer>>> Agarwal>>> as committers. Join me in congratulating both of them and thanking>>> them for>>> their contributions to the project!>>>  > Matei>>>  ->>> > To unsubscribe e-mail:> dev-unsubscribe@.apache>>> >>> ->>> To unsubscribe e-mail:> dev-unsubscribe@.apache --> Cell : 425-233-8271 <(425)%20233-8271>> Twitter: https://twitter.com/holdenkarau>>> -->> Takuya UESHIN>> Tokyo, Japan http://twitter.com/ueshin>>-Liang-Chi Hsieh | @viiryaSpark Technology Centerhttp://www.spark.tc/--View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Welcoming-Hyukjin-Kwon-and-Sameer-Agarwal-as-committers-tp22092p22109.htmlSent from the Apache Spark Developers List mailing list archive at Nabble.com.-To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[jira] [Created] (SPARK-21599) Collecting column statistics for datasource tables may fail with java.util.NoSuchElementException

2017-08-01 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-21599:


 Summary: Collecting column statistics for datasource tables may 
fail with java.util.NoSuchElementException
 Key: SPARK-21599
 URL: https://issues.apache.org/jira/browse/SPARK-21599
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Dilip Biswal


Collecting column level statistics for non compatible hive tables using 

{code}
ANALYZE TABLE  FOR COLUMNS 
{code}

may fail with the following exception.

{code}

key not found: a
java.util.NoSuchElementException: key not found: a
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:657)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1$$anonfun$apply$mcV$sp$3.apply(HiveExternalCatalog.scala:656)
at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply$mcV$sp(HiveExternalCatalog.scala:656)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$alterTableStats$1.apply(HiveExternalCatalog.scala:634)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.alterTableStats(HiveExternalCatalog.scala:634)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.alterTableStats(SessionCatalog.scala:375)
at 
org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:57)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20417) Move error reporting for subquery from Analyzer to CheckAnalysis

2017-04-20 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977286#comment-15977286
 ] 

Dilip Biswal commented on SPARK-20417:
--

Currently waiting on [pr 17636|https://github.com/apache/spark/pull/17636] to 
be merged. After that i will rebase and open a PR for this.

> Move error reporting for subquery from Analyzer to CheckAnalysis
> 
>
> Key: SPARK-20417
> URL: https://issues.apache.org/jira/browse/SPARK-20417
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dilip Biswal
>
> Currently we do a lot of validations for subquery in the Analyzer. We should 
> move them to CheckAnalysis which is the framework to catch and report 
> Analysis errors. This was mentioned as a review comment in SPARK-18874.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20417) Move error reporting for subquery from Analyzer to CheckAnalysis

2017-04-20 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-20417:


 Summary: Move error reporting for subquery from Analyzer to 
CheckAnalysis
 Key: SPARK-20417
 URL: https://issues.apache.org/jira/browse/SPARK-20417
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Dilip Biswal


Currently we do a lot of validations for subquery in the Analyzer. We should 
move them to CheckAnalysis which is the framework to catch and report Analysis 
errors. This was mentioned as a review comment in SPARK-18874.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20356) Spark sql group by returns incorrect results after join + distinct transformations

2017-04-18 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973929#comment-15973929
 ] 

Dilip Biswal commented on SPARK-20356:
--

[~viirya] Did you try from spark-shell or from one of our query suites ? I 
could reproduce it from spark-shell fine. From our query suites i had to force 
the number of shuffle partitions to reproduce it.
{code}
test("cache defect") {
withSQLConf("spark.sql.shuffle.partitions" -> "200") {
  val df1 = Seq(("a", 1), ("b", 1), ("c", 2)).toDF("item", "group")
  val df2 = Seq(("a", 1), ("b", 2), ("c", 3)).toDF("item", "id")
  val df3 = df1.join(df2, Seq("item")).select($"id", 
$"group".as("item")).distinct()

  df3.explain(true)

  df3.unpersist()
  val agg_without_cache = df3.groupBy($"item").count()
  agg_without_cache.show()

  df3.cache()
  val agg_with_cache = df3.groupBy($"item").count()
  agg_with_cache.explain(true)
  agg_with_cache.show()
}
  }
{code}

> Spark sql group by returns incorrect results after join + distinct 
> transformations
> --
>
> Key: SPARK-20356
> URL: https://issues.apache.org/jira/browse/SPARK-20356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Linux mint 18
> Python 3.5
>Reporter: Chris Kipers
>
> I'm experiencing a bug with the head version of spark as of 4/17/2017. After 
> joining to dataframes, renaming a column and invoking distinct, the results 
> of the aggregation is incorrect after caching the dataframe. The following 
> code snippet consistently reproduces the error.
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> mapping_sdf = spark.createDataFrame(pd.DataFrame([
> {"ITEM": "a", "GROUP": 1},
> {"ITEM": "b", "GROUP": 1},
> {"ITEM": "c", "GROUP": 2}
> ]))
> items_sdf = spark.createDataFrame(pd.DataFrame([
> {"ITEM": "a", "ID": 1},
> {"ITEM": "b", "ID": 2},
> {"ITEM": "c", "ID": 3}
> ]))
> mapped_sdf = \
> items_sdf.join(mapping_sdf, on='ITEM').select("ID", 
> sf.col("GROUP").alias('ITEM')).distinct()
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
> mapped_sdf.cache()
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 3, incorrect
> The next code snippet is almost the same after the first except I don't call 
> distinct on the dataframe. This snippet performs as expected:
> mapped_sdf = \
> items_sdf.join(mapping_sdf, on='ITEM').select("ID", 
> sf.col("GROUP").alias('ITEM'))
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
> mapped_sdf.cache()
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
> I don't experience this bug with spark 2.1 or event earlier versions for 2.2



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20356) Spark sql group by returns incorrect results after join + distinct transformations

2017-04-18 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973455#comment-15973455
 ] 

Dilip Biswal edited comment on SPARK-20356 at 4/18/17 8:47 PM:
---

[~viirya] [~hvanhovell] [~cloud_fan] [~smilegator]
I took a quick look and it seems the issue started happening after this 
[pr|https://github.com/apache/spark/pull/17175]. We are
changing the output partitioning information of InMemoryTableScanExec as part 
of the fix ( id, item -> item, item) causing a missing shuffle in the operators
above InMemoryTableScan. Changing to use the child's output partitioning like 
before fixes the issue. 

I am a little new to this code :-) And this is what i have found so far. Hope 
this helps.



was (Author: dkbiswal):
[~viirya] [~hvanhovell] [~cloud_fan] [~smilegator]
I took a quick look and it seems the issue started happening after this 
[pr|https://github.com/apache/spark/pull/17175]. We are
changing the output partitioning information as part of the fix ( id, item -> 
item, item) causing a missing shuffle in the operators
above InMemoryTableScan. Changing to use the child's output partitioning like 
before fixes the issue. 

I am a little new to this code :-) And this is what i have found so far. Hope 
this helps.


> Spark sql group by returns incorrect results after join + distinct 
> transformations
> --
>
> Key: SPARK-20356
> URL: https://issues.apache.org/jira/browse/SPARK-20356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Linux mint 18
> Python 3.5
>Reporter: Chris Kipers
>
> I'm experiencing a bug with the head version of spark as of 4/17/2017. After 
> joining to dataframes, renaming a column and invoking distinct, the results 
> of the aggregation is incorrect after caching the dataframe. The following 
> code snippet consistently reproduces the error.
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> mapping_sdf = spark.createDataFrame(pd.DataFrame([
> {"ITEM": "a", "GROUP": 1},
> {"ITEM": "b", "GROUP": 1},
> {"ITEM": "c", "GROUP": 2}
> ]))
> items_sdf = spark.createDataFrame(pd.DataFrame([
> {"ITEM": "a", "ID": 1},
> {"ITEM": "b", "ID": 2},
> {"ITEM": "c", "ID": 3}
> ]))
> mapped_sdf = \
> items_sdf.join(mapping_sdf, on='ITEM').select("ID", 
> sf.col("GROUP").alias('ITEM')).distinct()
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
> mapped_sdf.cache()
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 3, incorrect
> The next code snippet is almost the same after the first except I don't call 
> distinct on the dataframe. This snippet performs as expected:
> mapped_sdf = \
> items_sdf.join(mapping_sdf, on='ITEM').select("ID", 
> sf.col("GROUP").alias('ITEM'))
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
> mapped_sdf.cache()
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
> I don't experience this bug with spark 2.1 or event earlier versions for 2.2



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20356) Spark sql group by returns incorrect results after join + distinct transformations

2017-04-18 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973455#comment-15973455
 ] 

Dilip Biswal edited comment on SPARK-20356 at 4/18/17 8:46 PM:
---

[~viirya] [~hvanhovell] [~cloud_fan] [~smilegator]
I took a quick look and it seems the issue started happening after this 
[pr|https://github.com/apache/spark/pull/17175]. We are
changing the output partitioning information as part of the fix ( id, item -> 
item, item) causing a missing shuffle in the operators
above InMemoryTableScan. Changing to use the child's output partitioning like 
before fixes the issue. 

I am a little new to this code :-) And this is what i have found so far. Hope 
this helps.



was (Author: dkbiswal):
[~viirya] [~hvanhovell] [~cloud_fan] [~smilegator]
I took a quick look and it seems the issue started happening after 
[pr|https://github.com/apache/spark/pull/17175]. We are
changing the output partitioning information as part of the fix ( id, item -> 
item, item) causing a missing shuffle in the operators
above InMemoryTableScan. Changing to use the child's output partitioning like 
before fixes the issue. 

I am a little new to this code :-) And this is what i have found so far. Hope 
this helps.


> Spark sql group by returns incorrect results after join + distinct 
> transformations
> --
>
> Key: SPARK-20356
> URL: https://issues.apache.org/jira/browse/SPARK-20356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Linux mint 18
> Python 3.5
>Reporter: Chris Kipers
>
> I'm experiencing a bug with the head version of spark as of 4/17/2017. After 
> joining to dataframes, renaming a column and invoking distinct, the results 
> of the aggregation is incorrect after caching the dataframe. The following 
> code snippet consistently reproduces the error.
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> mapping_sdf = spark.createDataFrame(pd.DataFrame([
> {"ITEM": "a", "GROUP": 1},
> {"ITEM": "b", "GROUP": 1},
> {"ITEM": "c", "GROUP": 2}
> ]))
> items_sdf = spark.createDataFrame(pd.DataFrame([
> {"ITEM": "a", "ID": 1},
> {"ITEM": "b", "ID": 2},
> {"ITEM": "c", "ID": 3}
> ]))
> mapped_sdf = \
> items_sdf.join(mapping_sdf, on='ITEM').select("ID", 
> sf.col("GROUP").alias('ITEM')).distinct()
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
> mapped_sdf.cache()
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 3, incorrect
> The next code snippet is almost the same after the first except I don't call 
> distinct on the dataframe. This snippet performs as expected:
> mapped_sdf = \
> items_sdf.join(mapping_sdf, on='ITEM').select("ID", 
> sf.col("GROUP").alias('ITEM'))
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
> mapped_sdf.cache()
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
> I don't experience this bug with spark 2.1 or event earlier versions for 2.2



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20356) Spark sql group by returns incorrect results after join + distinct transformations

2017-04-18 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973455#comment-15973455
 ] 

Dilip Biswal commented on SPARK-20356:
--

[~viirya] [~hvanhovell] [~cloud_fan] [~smilegator]
I took a quick look and it seems the issue started happening after 
[pr|https://github.com/apache/spark/pull/17175]. We are
changing the output partitioning information as part of the fix ( id, item -> 
item, item) causing a missing shuffle in the operators
above InMemoryTableScan. Changing to use the child's output partitioning like 
before fixes the issue. 

I am a little new to this code :-) And this is what i have found so far. Hope 
this helps.


> Spark sql group by returns incorrect results after join + distinct 
> transformations
> --
>
> Key: SPARK-20356
> URL: https://issues.apache.org/jira/browse/SPARK-20356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Linux mint 18
> Python 3.5
>Reporter: Chris Kipers
>
> I'm experiencing a bug with the head version of spark as of 4/17/2017. After 
> joining to dataframes, renaming a column and invoking distinct, the results 
> of the aggregation is incorrect after caching the dataframe. The following 
> code snippet consistently reproduces the error.
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> mapping_sdf = spark.createDataFrame(pd.DataFrame([
> {"ITEM": "a", "GROUP": 1},
> {"ITEM": "b", "GROUP": 1},
> {"ITEM": "c", "GROUP": 2}
> ]))
> items_sdf = spark.createDataFrame(pd.DataFrame([
> {"ITEM": "a", "ID": 1},
> {"ITEM": "b", "ID": 2},
> {"ITEM": "c", "ID": 3}
> ]))
> mapped_sdf = \
> items_sdf.join(mapping_sdf, on='ITEM').select("ID", 
> sf.col("GROUP").alias('ITEM')).distinct()
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
> mapped_sdf.cache()
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 3, incorrect
> The next code snippet is almost the same after the first except I don't call 
> distinct on the dataframe. This snippet performs as expected:
> mapped_sdf = \
> items_sdf.join(mapping_sdf, on='ITEM').select("ID", 
> sf.col("GROUP").alias('ITEM'))
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
> mapped_sdf.cache()
> print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
> I don't experience this bug with spark 2.1 or event earlier versions for 2.2



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20334) Return a better error message when correlated predicates contain aggregate expression that has mixture of outer and local references

2017-04-13 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-20334:


 Summary: Return a better error message when correlated predicates 
contain aggregate expression that has mixture of outer and local references
 Key: SPARK-20334
 URL: https://issues.apache.org/jira/browse/SPARK-20334
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.1.0
Reporter: Dilip Biswal
Priority: Minor


Currently subqueries with correlated predicates containing aggregate expression 
having mixture of outer references and local references generate a code gen 
error like following :

{code:java}
Cannot evaluate expression: min((input[0, int, false] + input[4, int, false]))
at 
org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.nullSafeCodeGen(Expression.scala:461)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.defineCodeGen(Expression.scala:443)
at 
org.apache.spark.sql.catalyst.expressions.BinaryComparison.doGenCode(predicates.scala:431)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
{code}

We should catch this situation and return a better error message to the user.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19993) Caching logical plans containing subquery expressions does not work.

2017-03-16 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-19993:


 Summary: Caching logical plans containing subquery expressions 
does not work.
 Key: SPARK-19993
 URL: https://issues.apache.org/jira/browse/SPARK-19993
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.1.0
Reporter: Dilip Biswal


Here is a simple repro that depicts the problem. In this case the second 
invocation of the sql should have been from the cache. However the lookup fails 
currently.

{code}
scala> val ds = spark.sql("select * from s1 where s1.c1 in (select s2.c1 from 
s2 where s1.c1 = s2.c1)")
ds: org.apache.spark.sql.DataFrame = [c1: int]

scala> ds.cache
res13: ds.type = [c1: int]

scala> spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where 
s1.c1 = s2.c1)").explain(true)
== Analyzed Logical Plan ==
c1: int
Project [c1#86]
+- Filter c1#86 IN (list#78 [c1#86])
   :  +- Project [c1#87]
   : +- Filter (outer(c1#86) = c1#87)
   :+- SubqueryAlias s2
   :   +- Relation[c1#87] parquet
   +- SubqueryAlias s1
  +- Relation[c1#86] parquet

== Optimized Logical Plan ==
Join LeftSemi, ((c1#86 = c1#87) && (c1#86 = c1#87))
:- Relation[c1#86] parquet
+- Relation[c1#87] parquet
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Dilip Biswal

Congratulations, Takuya!
 
Regards,Dilip BiswalTel: 408-463-4980dbis...@us.ibm.com
 
 
- Original message -From: Takeshi Yamamuro To: dev Cc:Subject: Re: welcoming Takuya Ueshin as a new Apache Spark committerDate: Mon, Feb 13, 2017 2:14 PM 
congrats!
 
 
On Tue, Feb 14, 2017 at 6:05 AM, Sam Elamin  wrote:

Congrats Takuya-san! Clearly well deserved! Well done :) 
 
On Mon, Feb 13, 2017 at 9:02 PM, Maciej Szymkiewicz  wrote:

Congratulations!
On 02/13/2017 08:16 PM, Reynold Xin wrote:> Hi all,>> Takuya-san has recently been elected an Apache Spark committer. He's> been active in the SQL area and writes very small, surgical patches> that are high quality. Please join me in congratulating Takuya-san!> -To unsubscribe e-mail: dev-unsubscribe@spark.apache.org  

 --

---Takeshi Yamamuro
 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[jira] [Created] (SPARK-18533) Raise correct error upon specification of schema for datasource tables created through CTAS

2016-11-21 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-18533:


 Summary: Raise correct error upon specification of schema for 
datasource tables created through CTAS
 Key: SPARK-18533
 URL: https://issues.apache.org/jira/browse/SPARK-18533
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Dilip Biswal
Priority: Minor


Currently hive serde tables created through CTAS does not allow explicit 
specification of schema as its inferred from the select clause.  Currently a 
semantic error is raised for this case. However for data source tables 
currently we raise a parser error which is not as informative. We should raise 
consistent error for both forms of tables.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18009) Spark 2.0.1 SQL Thrift Error

2016-10-26 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15609295#comment-15609295
 ] 

Dilip Biswal commented on SPARK-18009:
--

[~martha.solarte] Not sure
[~smilegator] Sean, do we back port to 2.0.0 any more ?

> Spark 2.0.1 SQL Thrift Error
> 
>
> Key: SPARK-18009
> URL: https://issues.apache.org/jira/browse/SPARK-18009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: apache hadoop 2.6.2 
> spark 2.0.1
>Reporter: Jerryjung
>Priority: Critical
>  Labels: thrift
>
> After deploy spark thrift server on YARN, then I tried to execute from the 
> beeline following command.
> > show databases;
> I've got this error message. 
> {quote}
> beeline> !connect jdbc:hive2://localhost:1 a a
> Connecting to jdbc:hive2://localhost:1
> 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1
> 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1
> 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:1
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:1> show databases;
> java.lang.IllegalStateException: Can't overwrite cause with 
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at java.lang.Throwable.initCause(Throwable.java:456)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197)
>   at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
>   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
>   at 
> org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365)
>   at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42)
>   at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794)
>   at org.apache.hive.beeline.Commands.execute(Commands.java:860)
>   at org.apache.hive.beeline.Commands.sql(Commands.java:713)
>   at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
>   at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
>   at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
>   at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
&g

[jira] [Commented] (SPARK-18009) Spark 2.0.1 SQL Thrift Error

2016-10-25 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607343#comment-15607343
 ] 

Dilip Biswal commented on SPARK-18009:
--

[~smilegator][~jerryjung] [~martha.solarte] Thanks. I am testing a fix and 
should submit a PR for this soon.

> Spark 2.0.1 SQL Thrift Error
> 
>
> Key: SPARK-18009
> URL: https://issues.apache.org/jira/browse/SPARK-18009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: apache hadoop 2.6.2 
> spark 2.0.1
>Reporter: Jerryjung
>Priority: Critical
>  Labels: thrift
>
> After deploy spark thrift server on YARN, then I tried to execute from the 
> beeline following command.
> > show databases;
> I've got this error message. 
> {quote}
> beeline> !connect jdbc:hive2://localhost:1 a a
> Connecting to jdbc:hive2://localhost:1
> 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1
> 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1
> 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:1
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:1> show databases;
> java.lang.IllegalStateException: Can't overwrite cause with 
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at java.lang.Throwable.initCause(Throwable.java:456)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197)
>   at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
>   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
>   at 
> org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365)
>   at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42)
>   at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794)
>   at org.apache.hive.beeline.Commands.execute(Commands.java:860)
>   at org.apache.hive.beeline.Commands.sql(Commands.java:713)
>   at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
>   at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
>   at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
>   at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorA

[jira] [Created] (SPARK-17860) SHOW COLUMN's database conflict check should use case sensitive compare.

2016-10-10 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-17860:


 Summary: SHOW COLUMN's database conflict check should use case 
sensitive compare.
 Key: SPARK-17860
 URL: https://issues.apache.org/jira/browse/SPARK-17860
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Dilip Biswal
Priority: Minor


SHOW COLUMNS command validates the user supplied database 
name with database name from qualified table name name to make
sure both of them are consistent. This comparison should respect
case sensitivity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17860) SHOW COLUMN's database conflict check should respect case sensitivity setting

2016-10-10 Thread Dilip Biswal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dilip Biswal updated SPARK-17860:
-
Summary: SHOW COLUMN's database conflict check should respect case 
sensitivity setting  (was: SHOW COLUMN's database conflict check should use 
case sensitive compare.)

> SHOW COLUMN's database conflict check should respect case sensitivity setting
> -
>
> Key: SPARK-17860
> URL: https://issues.apache.org/jira/browse/SPARK-17860
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> SHOW COLUMNS command validates the user supplied database 
> name with database name from qualified table name name to make
> sure both of them are consistent. This comparison should respect
> case sensitivity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: welcoming Xiao Li as a committer

2016-10-03 Thread Dilip Biswal

Hi Xiao,

Congratulations Xiao !!  This is indeed very well deserved !! 

Regards,
Dilip Biswal
Tel: 408-463-4980
dbis...@us.ibm.com



From:   Reynold Xin 
To: "dev@spark.apache.org" , Xiao Li 

Date:   10/03/2016 10:47 PM
Subject:welcoming Xiao Li as a committer



Hi all,

Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark 
committer. Xiao has been a super active contributor to Spark SQL. Congrats 
and welcome, Xiao!

- Reynold

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-03 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543141#comment-15543141
 ] 

Dilip Biswal commented on SPARK-17709:
--

@ashrowty Hi Ashish, in your example, the column loyalitycardnumber is not in 
the outputset and that is why we see the exception. I tried using productid 
instead and got
the correct result.

{code}
scala> df1.join(df2, Seq("companyid","loyaltycardnumber"));
org.apache.spark.sql.AnalysisException: using columns 
['companyid,'loyaltycardnumber] can not be resolved given input columns: 
[productid, companyid, avgprice, avgitemcount, companyid, productid] ;
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:132)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2651)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:679)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:652)
  ... 48 elided

scala> df1.join(df2, Seq("companyid","productid"));
res1: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 2 
more fields]

scala> df1.join(df2, Seq("companyid","productid")).show
+-+-+++ 
|companyid|productid|avgprice|avgitemcount|
+-+-+++
|  101|3|13.0|12.0|
|  100|1|10.0|10.0|
+-+-+++
{code}

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-03 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543046#comment-15543046
 ] 

Dilip Biswal commented on SPARK-17709:
--

Hi Ashish, Thanks a lot.. will try and get back.

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537336#comment-15537336
 ] 

Dilip Biswal commented on SPARK-17709:
--

[~ashrowty] Hmmn.. and your join keys are companyid or loyalitycardnumber or 
both ? If so, i have the exact same scenario but not seeing the error you are 
seeing.

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537205#comment-15537205
 ] 

Dilip Biswal edited comment on SPARK-17709 at 9/30/16 10:07 PM:


@ashrowty Hi Ashish, is it possible for you to post explain output for both the 
legs of the join. 
So if we are joining two dataframes df1 and df2 , can we get the output of
df1.explain(true)
df2.explain(true)

>From the error, it seems like key1 and key2 are not present in one leg of join 
>output attribute set.

So if i were to change your test program to the following :

{code}
 val df1 = d1.groupBy("key1", "key2")
  .agg(avg("totalprice").as("avgtotalprice"))
  df1.explain(true)
  val df2 = d1.agg(avg("itemcount").as("avgqty"))
  df2.explain(true)
df1.join(df2, Seq("key1", "key2"))
{code}
I am able to see the same error you are seeing.


was (Author: dkbiswal):
@ashrowty Hi Ashish, is it possible for you to post explain output for both the 
legs of the join. 
So if we are joining two dataframes df1 and df2 , can we get the output of
df1.explain(true)
df2.explain(true)

>From the error, it seems like key1 and key2 are not present in one leg of join 
>output attribute set.

So if i were to change your test program to the following :

 val df1 = d1.groupBy("key1", "key2")
  .agg(avg("totalprice").as("avgtotalprice"))
  df1.explain(true)
  val df2 = d1.agg(avg("itemcount").as("avgqty"))
  df2.explain(true)
df1.join(df2, Seq("key1", "key2"))

I am able to see the same error you are seeing.

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537205#comment-15537205
 ] 

Dilip Biswal commented on SPARK-17709:
--

@ashrowty Hi Ashish, is it possible for you to post explain output for both the 
legs of the join. 
So if we are joining two dataframes df1 and df2 , can we get the output of
df1.explain(true)
df2.explain(true)

>From the error, it seems like key1 and key2 are not present in one leg of join 
>output attribute set.

So if i were to change your test program to the following :

 val df1 = d1.groupBy("key1", "key2")
  .agg(avg("totalprice").as("avgtotalprice"))
  df1.explain(true)
  val df2 = d1.agg(avg("itemcount").as("avgqty"))
  df2.explain(true)
df1.join(df2, Seq("key1", "key2"))

I am able to see the same error you are seeing.

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536464#comment-15536464
 ] 

Dilip Biswal commented on SPARK-17709:
--

[~ashrowty] Ashish, you have the same column name as regular and partitioning 
columns ? I thought hive didn't allow it ?

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-29 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534434#comment-15534434
 ] 

Dilip Biswal commented on SPARK-17709:
--

[~smilegator] Sure.

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-29 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534417#comment-15534417
 ] 

Dilip Biswal commented on SPARK-17709:
--

[~smilegator] Hi Sean, I tried it on my master branch and don't see the 
exception.

{code}
test("join issue") {
   withTable("tbl") {
 sql("CREATE TABLE tbl(key1 int, key2 int, totalprice int, itemcount int)")
 sql("insert into tbl values (1, 1, 1, 1)")
 val d1 = sql("select * from tbl")
 val df1 = d1.groupBy("key1","key2")
   .agg(avg("totalprice").as("avgtotalprice"))
 val df2 = d1.groupBy("key1","key2")
   .agg(avg("itemcount").as("avgqty"))
 df1.join(df2, Seq("key1","key2")).show()
   }
 }

Output

+++-+--+
|key1|key2|avgtotalprice|avgqty|
+++-+--+
|   1|   1|  1.0|   1.0|
+++-+--+
{code}



> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17620) hive.default.fileformat=orc does not set OrcSerde

2016-09-21 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511267#comment-15511267
 ] 

Dilip Biswal commented on SPARK-17620:
--

fix it now. Thanks!

> hive.default.fileformat=orc does not set OrcSerde
> -
>
> Key: SPARK-17620
> URL: https://issues.apache.org/jira/browse/SPARK-17620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Brian Cho
>Priority: Minor
>
> Setting {{hive.default.fileformat=orc}} does not set OrcSerde. This behavior 
> is inconsistent with {{STORED AS ORC}}. This means we cannot set a default 
> behavior for creating tables using orc.
> The behavior using stored as:
> {noformat}
> scala> spark.sql("CREATE TABLE tmp_stored_as(id INT) STORED AS ORC")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_stored_as").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}
> Behavior setting default conf (SerDe Library is not set properly):
> {noformat}
> scala> spark.sql("SET hive.default.fileformat=orc")
> res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("CREATE TABLE tmp_default(id INT)")
> res3: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
> ...
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
> [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16423) Inconsistent settings on the first day of a week

2016-07-09 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15369006#comment-15369006
 ] 

Dilip Biswal commented on SPARK-16423:
--

[~yhuai] [~smilegator]

Just wanted to quickly share the information i have found so far. 

- WeekOfYear
  - Checked the mysql, hive behaviour. Spark seems to be consistent with the
mysql and hive. They both assume the first day of week to be Monday and
more than 3 days per week.
http://www.techonthenet.com/mysql/functions/weekofyear.php
  - Postgres - week 
The number of the ISO 8601 week-numbering week of the year. By definition, 
ISO weeks start on Mondays and the first week of a year contains January 4 
of that year. 
In other words, the first Thursday of a year is in week 1 of that year.
https://www.postgresql.org/docs/current/static/functions-datetime.html 
(function : week)

  - SQL Server
They have a way to set a registry variable to influence the first day of 
the week.
SET DATEFIRST { number | @number_var } 
When this is set, the DATEPART function considers the setting while 
calculating
day of week. When this is not set, they also seem to follow ISO which again 
assumes
Monday to be start of the week.
https://msdn.microsoft.com/en-us/library/ms186724.aspx
  - Oracle 
In case of oracle, the day of the week is controlled by session specific 
NLS_TERRITORY
setting.
https://community.oracle.com/thread/2207756?tstart=0
  - DB2
Have two flavors of WEEK function. One for ISO (Monday start) and other one 
for non ISO 
(Sunday start).

http://www.ibm.com/developerworks/data/library/techarticle/0211yip/0211yip3.html
  

 
Given this, it seems like more systems follow Monday to be first day of week 
semantics and
i am wondering if we should change this ?

Also, is there a co-relation between fromUnixTime and WeekOfYear. fromUnixTime 
returns
the user supplied time in seconds in string after applying the date format. In 
my understanding
it respects the system locale settings.

> Inconsistent settings on the first day of a week
> 
>
> Key: SPARK-16423
> URL: https://issues.apache.org/jira/browse/SPARK-16423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> For the function {{WeekOfYear}}, we explicitly set the first day of the week 
> to {{Calendar.MONDAY}}. However, {{FromUnixTime}} does not explicitly set it. 
> So, we are using the default first day of the week based on the locale 
> setting (see 
> https://docs.oracle.com/javase/8/docs/api/java/util/Calendar.html#setFirstDayOfWeek-int-).
>  
> Let's do a survey on what other databases do and make the setting consistent. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16195) Allow users to specify empty over clause in window expressions through dataset API

2016-06-24 Thread Dilip Biswal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dilip Biswal updated SPARK-16195:
-
Description: 
In SQL, its allowed to specify an empty OVER clause in the window expression.

{code}
select area, sum(product) over () as c from windowData
where product > 3 group by area, product
having avg(month) > 0 order by avg(month), product
{code}

In this case the analytic function sum is presented based on all the rows of 
the result set

Currently its not allowed through dataset API.


  was:
In SQL, its allowed to specify an empty OVER clause in the window expression.
 
select area, sum(product) over () as c from windowData
where product > 3 group by area, product
having avg(month) > 0 order by avg(month), product

In this case the analytic function sum is presented based on all the rows of 
the result set

Currently its not allowed through dataset API.



> Allow users to specify empty over clause in window expressions through 
> dataset API
> --
>
> Key: SPARK-16195
> URL: https://issues.apache.org/jira/browse/SPARK-16195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> In SQL, its allowed to specify an empty OVER clause in the window expression.
> {code}
> select area, sum(product) over () as c from windowData
> where product > 3 group by area, product
> having avg(month) > 0 order by avg(month), product
> {code}
> In this case the analytic function sum is presented based on all the rows of 
> the result set
> Currently its not allowed through dataset API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16195) Allow users to specify empty over clause in window expressions through dataset API

2016-06-24 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-16195:


 Summary: Allow users to specify empty over clause in window 
expressions through dataset API
 Key: SPARK-16195
 URL: https://issues.apache.org/jira/browse/SPARK-16195
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Dilip Biswal
Priority: Minor


In SQL, its allowed to specify an empty OVER clause in the window expression.
 
select area, sum(product) over () as c from windowData
where product > 3 group by area, product
having avg(month) > 0 order by avg(month), product

In this case the analytic function sum is presented based on all the rows of 
the result set

Currently its not allowed through dataset API.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15634) SQL repl is bricked if a function is registered with a non-existent jar

2016-05-27 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305143#comment-15305143
 ] 

Dilip Biswal commented on SPARK-15634:
--

I would like to work on this issue.

> SQL repl is bricked if a function is registered with a non-existent jar
> ---
>
> Key: SPARK-15634
> URL: https://issues.apache.org/jira/browse/SPARK-15634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eric Liang
>
> After attempting to register a function using a non-existent jar, no further 
> SQL commands succeed (and you also cannot un-register the function).
> {code}
> build/sbt -Phive sparkShell
> {code}
> {code}
> scala> sql("""CREATE TEMPORARY FUNCTION x AS "com.example.functions.Function" 
> USING JAR "file:///path/to/example.jar"""")
> 16/05/27 14:53:49 ERROR SessionState: file:///path/to/example.jar does not 
> exist
> java.lang.IllegalArgumentException: file:///path/to/example.jar does not exist
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.validateFiles(SessionState.java:998)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState$ResourceType.preHook(SessionState.java:1102)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState$ResourceType$1.preHook(SessionState.java:1091)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1191)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.add_resources(SessionState.java:1149)
>   at 
> org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:67)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:564)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:533)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:260)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:207)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:206)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:249)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:533)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:523)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.addJar(HiveClientImpl.scala:668)
>   at 
> org.apache.spark.sql.hive.HiveSessionState.addJar(HiveSessionState.scala:109)
>   at 
> org.apache.spark.sql.internal.SessionState$$anon$2.loadResource(SessionState.scala:80)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$loadFunctionResources$1.apply(SessionCatalog.scala:734)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadFunctionResources(SessionCatalog.scala:734)
>   at 
> org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:59)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryEx

[jira] [Comment Edited] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null

2016-05-27 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304782#comment-15304782
 ] 

Dilip Biswal edited comment on SPARK-15557 at 5/27/16 9:09 PM:
---

I am looking into this issue. I am testing a fix currently.


was (Author: dkbiswal):
I am looking into this issue.

> expression ((cast(99 as decimal) + '3') * '2.3' ) return null
> -
>
> Key: SPARK-15557
> URL: https://issues.apache.org/jira/browse/SPARK-15557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: spark1.6.1
>Reporter: cen yuhai
>
> expression "select  (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null
> expression "select  (cast(40 as decimal(19,6))+ '3')*'2.3' "  is OK
> I find that maybe it will be null if the result is more than 100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15557) expression ((cast(99 as decimal) + '3') * '2.3' ) return null

2016-05-27 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304782#comment-15304782
 ] 

Dilip Biswal commented on SPARK-15557:
--

I am looking into this issue.

> expression ((cast(99 as decimal) + '3') * '2.3' ) return null
> -
>
> Key: SPARK-15557
> URL: https://issues.apache.org/jira/browse/SPARK-15557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: spark1.6.1
>Reporter: cen yuhai
>
> expression "select  (cast(99 as decimal(19,6))+ '3')*'2.3' " will return null
> expression "select  (cast(40 as decimal(19,6))+ '3')*'2.3' "  is OK
> I find that maybe it will be null if the result is more than 100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15114) Column name generated by typed aggregate is super verbose

2016-05-10 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279229#comment-15279229
 ] 

Dilip Biswal commented on SPARK-15114:
--

Going to submit a PR for this tonight.

> Column name generated by typed aggregate is super verbose
> -
>
> Key: SPARK-15114
> URL: https://issues.apache.org/jira/browse/SPARK-15114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> {code}
> case class Person(name: String, email: String, age: Long)
> val ds = spark.read.json("/tmp/person.json").as[Person]
> import org.apache.spark.sql.expressions.scala.typed._
> ds.groupByKey(_ => 0).agg(sum(_.age))
> // org.apache.spark.sql.Dataset[(Int, Double)] = [value: int, 
> typedsumdouble(unresolveddeserializer(newInstance(class Person), age#0L, 
> email#1, name#2), upcast(value)): double]
> ds.groupByKey(_ => 0).agg(sum(_.age)).explain
> == Physical Plan ==
> WholeStageCodegen
> :  +- TungstenAggregate(key=[value#84], 
> functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Final,isDistinct=false)],
>  output=[value#84,typedsumdouble(unresolveddeserializer(newInstance(class 
> $line15.$read$$iw$$iw$Person), age#0L, email#1, name#2), upcast(value))#91])
> : +- INPUT
> +- Exchange hashpartitioning(value#84, 200), None
>+- WholeStageCodegen
>   :  +- TungstenAggregate(key=[value#84], 
> functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Partial,isDistinct=false)],
>  output=[value#84,value#97])
>   : +- INPUT
>   +- AppendColumns , newInstance(class 
> $line15.$read$$iw$$iw$Person), [input[0, int] AS value#84]
>  +- WholeStageCodegen
> :  +- Scan HadoopFiles[age#0L,email#1,name#2] Format: JSON, 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15114) Column name generated by typed aggregate is super verbose

2016-05-04 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271972#comment-15271972
 ] 

Dilip Biswal commented on SPARK-15114:
--

[~yhuai] Sure Yin. I will give it a try.

> Column name generated by typed aggregate is super verbose
> -
>
> Key: SPARK-15114
> URL: https://issues.apache.org/jira/browse/SPARK-15114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> {code}
> case class Person(name: String, email: String, age: Long)
> val ds = spark.read.json("/tmp/person.json").as[Person]
> import org.apache.spark.sql.expressions.scala.typed._
> ds.groupByKey(_ => 0).agg(sum(_.age))
> // org.apache.spark.sql.Dataset[(Int, Double)] = [value: int, 
> typedsumdouble(unresolveddeserializer(newInstance(class Person), age#0L, 
> email#1, name#2), upcast(value)): double]
> ds.groupByKey(_ => 0).agg(sum(_.age)).explain
> == Physical Plan ==
> WholeStageCodegen
> :  +- TungstenAggregate(key=[value#84], 
> functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Final,isDistinct=false)],
>  output=[value#84,typedsumdouble(unresolveddeserializer(newInstance(class 
> $line15.$read$$iw$$iw$Person), age#0L, email#1, name#2), upcast(value))#91])
> : +- INPUT
> +- Exchange hashpartitioning(value#84, 200), None
>+- WholeStageCodegen
>   :  +- TungstenAggregate(key=[value#84], 
> functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Partial,isDistinct=false)],
>  output=[value#84,value#97])
>   : +- INPUT
>   +- AppendColumns , newInstance(class 
> $line15.$read$$iw$$iw$Person), [input[0, int] AS value#84]
>  +- WholeStageCodegen
> :  +- Scan HadoopFiles[age#0L,email#1,name#2] Format: JSON, 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15114) Column name generated by typed aggregate is super verbose

2016-05-04 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271351#comment-15271351
 ] 

Dilip Biswal commented on SPARK-15114:
--

[~yhuai] 
Currently we use the sql representation of the expression as the system 
generated alias name. like following.

case expr: Expression => Alias(expr, usePrettyExpression(expr).sql)()

Do we add a additional case for AggregateExpression and use toString() instead 
of sql() for a shorter name ? Like,

case aggExpr: AggregateExpression => Alias(aggExpr, 
usePrettyExpression(aggExpr).toString)()

> Column name generated by typed aggregate is super verbose
> -
>
> Key: SPARK-15114
> URL: https://issues.apache.org/jira/browse/SPARK-15114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> {code}
> case class Person(name: String, email: String, age: Long)
> val ds = spark.read.json("/tmp/person.json").as[Person]
> import org.apache.spark.sql.expressions.scala.typed._
> ds.groupByKey(_ => 0).agg(sum(_.age))
> // org.apache.spark.sql.Dataset[(Int, Double)] = [value: int, 
> typedsumdouble(unresolveddeserializer(newInstance(class Person), age#0L, 
> email#1, name#2), upcast(value)): double]
> ds.groupByKey(_ => 0).agg(sum(_.age)).explain
> == Physical Plan ==
> WholeStageCodegen
> :  +- TungstenAggregate(key=[value#84], 
> functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Final,isDistinct=false)],
>  output=[value#84,typedsumdouble(unresolveddeserializer(newInstance(class 
> $line15.$read$$iw$$iw$Person), age#0L, email#1, name#2), upcast(value))#91])
> : +- INPUT
> +- Exchange hashpartitioning(value#84, 200), None
>+- WholeStageCodegen
>   :  +- TungstenAggregate(key=[value#84], 
> functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Partial,isDistinct=false)],
>  output=[value#84,value#97])
>   : +- INPUT
>   +- AppendColumns , newInstance(class 
> $line15.$read$$iw$$iw$Person), [input[0, int] AS value#84]
>  +- WholeStageCodegen
> :  +- Scan HadoopFiles[age#0L,email#1,name#2] Format: JSON, 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14947) Showtable Command - Can't List Tables Using JDBC Connector

2016-04-27 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259692#comment-15259692
 ] 

Dilip Biswal commented on SPARK-14947:
--

[~raymond.honderd...@sizmek.com] Hi Raymond, could you please share a little 
more details on how to reproduce this ? Do we see any error or we see an empty 
output ? Also, what is the SQL microstrategy issues to Spark ? 

> Showtable Command - Can't List Tables Using JDBC Connector
> --
>
> Key: SPARK-14947
> URL: https://issues.apache.org/jira/browse/SPARK-14947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Raymond Honderdors
>Priority: Minor
> Fix For: 2.0.0
>
>
> Showtable Command does not list tables using external tool like 
> (microstrategy)
> it does work in beeline
> between the master and 1.6 branch there is a difference in the command file, 
> 1 it was relocated 2 the content changed.
> when i compiled the master branch with the "old" version of the code JDBC 
> functionality was restored



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13698) Fix Analysis Exceptions when Using Backticks in Generate

2016-04-18 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246796#comment-15246796
 ] 

Dilip Biswal commented on SPARK-13698:
--

[~cloud_fan] Hi Wenchen, Can you please help to fix the assignee field for this 
JIRA ? Thanks in advance !!

> Fix Analysis Exceptions when Using Backticks in Generate
> 
>
> Key: SPARK-13698
> URL: https://issues.apache.org/jira/browse/SPARK-13698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>    Reporter: Dilip Biswal
>
> Analysis exception occurs while running the following query.
> {code}
> SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints`
> {code}
> {code}
> Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot 
> resolve '`ints`' given input columns: [a, `ints`]; line 1 pos 7
> 'Project ['ints]
> +- Generate explode(a#0.b), true, false, Some(a), [`ints`#8]
>+- SubqueryAlias nestedarray
>   +- LocalRelation [a#0], 1,2,3
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14445) Support native execution of SHOW COLUMNS and SHOW PARTITIONS command

2016-04-06 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-14445:


 Summary: Support native execution of SHOW COLUMNS  and SHOW 
PARTITIONS command
 Key: SPARK-14445
 URL: https://issues.apache.org/jira/browse/SPARK-14445
 Project: Spark
  Issue Type: Improvement
Reporter: Dilip Biswal


1. Support native execution of SHOW COLUMNS
2. Support native execution of SHOW PARTITIONS

The syntax of SHOW commands are described in following link.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Show



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14121) Show commands (Native)

2016-04-06 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228932#comment-15228932
 ] 

Dilip Biswal commented on SPARK-14121:
--

I will submit a pull request for SHOW COLUMNS and SHOW PARTITIONS today. Thanks 
!!

> Show commands (Native)
> --
>
> Key: SPARK-14121
> URL: https://issues.apache.org/jira/browse/SPARK-14121
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> For the following tokens, we should have native implementations.
> -TOK_SHOWDATABASES (Native)-
> -TOK_SHOWTABLES (Native)-
> -TOK_SHOW_TBLPROPERTIES (Native)-
> TOK_SHOWCOLUMNS (Native)
> TOK_SHOWPARTITIONS (Native)
> TOK_SHOW_TABLESTATUS (Native)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14348) Support native execution of SHOW TBLPROPERTIES command

2016-04-02 Thread Dilip Biswal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dilip Biswal updated SPARK-14348:
-
Summary: Support native execution of SHOW TBLPROPERTIES command  (was: 
Support native execution of SHOW DATABASE command)

> Support native execution of SHOW TBLPROPERTIES command
> --
>
> Key: SPARK-14348
> URL: https://issues.apache.org/jira/browse/SPARK-14348
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>    Reporter: Dilip Biswal
>
> 1. Support parsing of SHOW TBLPROPERTIES command
> 2. Support the native execution of SHOW TBLPROPERTIES command
> The syntax for SHOW commands are described in following link:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14348) Support native execution of SHOW DATABASE command

2016-04-02 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-14348:


 Summary: Support native execution of SHOW DATABASE command
 Key: SPARK-14348
 URL: https://issues.apache.org/jira/browse/SPARK-14348
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Dilip Biswal


1. Support parsing of SHOW TBLPROPERTIES command
2. Support the native execution of SHOW TBLPROPERTIES command

The syntax for SHOW commands are described in following link:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14121) Show commands (Native)

2016-03-27 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213743#comment-15213743
 ] 

Dilip Biswal commented on SPARK-14121:
--


Just submitted a PR for SHOW TABLES and SHOW DATABASES. Will look into the rest 
of commands.

Regards,
-- Dilip

> Show commands (Native)
> --
>
> Key: SPARK-14121
> URL: https://issues.apache.org/jira/browse/SPARK-14121
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> For the following tokens, we should have native implementations.
> TOK_SHOWDATABASES (Native)
> TOK_SHOWTABLES (Native)
> TOK_SHOW_TBLPROPERTIES (Native)
> TOK_SHOWCOLUMNS (Native)
> TOK_SHOWPARTITIONS (Native)
> TOK_SHOW_TABLESTATUS (Native)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14184) Support native execution of SHOW DATABASE command and fix SHOW TABLE to use table identifier pattern

2016-03-27 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-14184:


 Summary: Support native execution of SHOW DATABASE command and fix 
SHOW TABLE to use table identifier pattern
 Key: SPARK-14184
 URL: https://issues.apache.org/jira/browse/SPARK-14184
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Dilip Biswal


Need to address the following two scenarios.

1. Support native execution of SHOW DATABASES 
2. Currently native execution of SHOW TABLES is supported with the exception 
that identifier_with_wildcards is not passed to the plan. So
SHOW TABLES 'pattern' fails.

The syntax for SHOW commands are described in following link:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13859) TPCDS query 38 returns wrong results compared to TPC official result set

2016-03-19 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199097#comment-15199097
 ] 

Dilip Biswal commented on SPARK-13859:
--

I have looked into this issue. After changing the query to use null safe equal 
operators in join condition, I can get the expected count of rows.
In this case , the data set has lots of NULL values for columns which are part 
of join condition. 

for example : {color:green}
ON (tmp1.c_last_name = tmp2.c_last_name) and (tmp1.c_first_name = 
tmp2.c_first_name) and (tmp1.d_date = tmp2.d_date)
{color} 
We need to use the null safe equal operator to select these rows. Below is the 
query output :

{code}
spark-sql> 
 > select  count(*) from (
 > select distinct c_last_name, c_first_name, d_date
 > from store_sales
 >  JOIN date_dim ON store_sales.ss_sold_date_sk <=> 
date_dim.d_date_sk
 >  JOIN customer ON store_sales.ss_customer_sk <=> 
customer.c_customer_sk
 > where d_month_seq between 1200 and 1200 + 11) tmp1
 >   JOIN
 > (select distinct c_last_name, c_first_name, d_date
 > from catalog_sales
 >  JOIN date_dim ON catalog_sales.cs_sold_date_sk <=> 
date_dim.d_date_sk
 >  JOIN customer ON catalog_sales.cs_bill_customer_sk <=> 
customer.c_customer_sk
 > where d_month_seq between 1200 and 1200 + 11) tmp2 ON 
(tmp1.c_last_name <=> tmp2.c_last_name) and (tmp1.c_first_name <=> 
tmp2.c_first_name) and (tmp1.d_date <=> tmp2.d_date)
 >   JOIN
 > (
 > select distinct c_last_name, c_first_name, d_date
 > from web_sales
 >  JOIN date_dim ON web_sales.ws_sold_date_sk <=> 
date_dim.d_date_sk
 >  JOIN customer ON web_sales.ws_bill_customer_sk <=> 
customer.c_customer_sk
 > where d_month_seq between 1200 and 1200 + 11) tmp3 ON 
(tmp1.c_last_name <=> tmp3.c_last_name) and (tmp1.c_first_name <=> 
tmp3.c_first_name) and (tmp1.d_date <=> tmp3.d_date)
 >   limit 100
 >  ;
107
{code}

[~jfc...@us.ibm.com] Jesse, lets please try the modified query in your 
environment. Note : I have tried this on my 2.0 dev environment. 

> TPCDS query 38 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13859
> URL: https://issues.apache.org/jira/browse/SPARK-13859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 38 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 0, answer set reports 107.
> Actual results:
> {noformat}
> [0]
> {noformat}
> Expected:
> {noformat}
> +-+
> |   1 |
> +-+
> | 107 |
> +-+
> {noformat}
> query used:
> {noformat}
> -- start query 38 in stream 0 using template query38.tpl and seed 
> QUALIFICATION
>  select  count(*) from (
> select distinct c_last_name, c_first_name, d_date
> from store_sales
>  JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp1
>   JOIN
> (select distinct c_last_name, c_first_name, d_date
> from catalog_sales
>  JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = 
> tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and 
> (tmp1.d_date = tmp2.d_date) 
>   JOIN
> (
> select distinct c_last_name, c_first_name, d_date
> from web_sales
>  JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = 
> tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and 
> (tmp1.d_date = tmp3.d_date) 
>   limit 100
>  ;
> -- end query 38 in stream 0 using template query38.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13865) TPCDS query 87 returns wrong results compared to TPC official result set

2016-03-19 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199136#comment-15199136
 ] 

Dilip Biswal commented on SPARK-13865:
--

[~smilegator] Quick update on this ..

This also seems related to to null safe equal issue. I just put a comment 
[spark-13859|https://issues.apache.org/jira/browse/SPARK-13859]

Here is the output of the query with expected count after doing similar 
modification.

{code}
spark-sql> select count(*)
 > from 
 >  (select distinct c_last_name as cln1, c_first_name as cfn1, 
d_date as ddate1, 1 as notnull1
 >from store_sales
 > JOIN date_dim ON store_sales.ss_sold_date_sk <=> 
date_dim.d_date_sk
 > JOIN customer ON store_sales.ss_customer_sk <=> 
customer.c_customer_sk
 >where
 >  d_month_seq between 1200 and 1200+11
 >) tmp1
 >left outer join
 >   (select distinct c_last_name as cln2, c_first_name as cfn2, 
d_date as ddate2, 1 as notnull2
 >from catalog_sales
 > JOIN date_dim ON catalog_sales.cs_sold_date_sk <=> 
date_dim.d_date_sk
 > JOIN customer ON catalog_sales.cs_bill_customer_sk <=> 
customer.c_customer_sk
 >where 
 >  d_month_seq between 1200 and 1200+11
 >) tmp2 
 >   on (tmp1.cln1 <=> tmp2.cln2)
 >   and (tmp1.cfn1 <=> tmp2.cfn2)
 >   and (tmp1.ddate1<=> tmp2.ddate2)
 >left outer join
 >   (select distinct c_last_name as cln3, c_first_name as cfn3 , 
d_date as ddate3, 1 as notnull3
 >from web_sales
 > JOIN date_dim ON web_sales.ws_sold_date_sk <=> 
date_dim.d_date_sk
 > JOIN customer ON web_sales.ws_bill_customer_sk <=> 
customer.c_customer_sk
 >where 
 >  d_month_seq between 1200 and 1200+11
 >) tmp3 
 >   on (tmp1.cln1 <=> tmp3.cln3)
 >   and (tmp1.cfn1 <=> tmp3.cfn3)
 >   and (tmp1.ddate1<=> tmp3.ddate3)
 > where  
 > notnull2 is null and notnull3 is null;
47298   
Time taken: 13.561 seconds, Fetched 1 row(s)

{code}

> TPCDS query 87 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13865
> URL: https://issues.apache.org/jira/browse/SPARK-13865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 87 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 47555, answer set expects 47298.
> Actual results:
> {noformat}
> [47555]
> {noformat}
> {noformat}
> Expected:
> +---+
> | 1 |
> +---+
> | 47298 |
> +---+
> {noformat}
> Query used:
> {noformat}
> -- start query 87 in stream 0 using template query87.tpl and seed 
> QUALIFICATION
> select count(*) 
> from 
>  (select distinct c_last_name as cln1, c_first_name as cfn1, d_date as 
> ddate1, 1 as notnull1
>from store_sales
> JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>where
>  d_month_seq between 1200 and 1200+11
>) tmp1
>left outer join
>   (select distinct c_last_name as cln2, c_first_name as cfn2, d_date as 
> ddate2, 1 as notnull2
>from catalog_sales
> JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>where 
>  d_month_seq between 1200 and 1200+11
>) tmp2 
>   on (tmp1.cln1 = tmp2.cln2)
>   and (tmp1.cfn1 = tmp2.cfn2)
>   and (tmp1.ddate1= tmp2.ddate2)
>left outer join
>   (select distinct c_last_name as cln3, c_first_name as cfn3 , d_date as 
> ddate3, 1 as notnull3
>from web_sales
> JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
> JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>where

[jira] [Commented] (SPARK-13859) TPCDS query 38 returns wrong results compared to TPC official result set

2016-03-19 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200465#comment-15200465
 ] 

Dilip Biswal commented on SPARK-13859:
--

Hello,

Just checked the original spec for this query from tpcds website. Here is the 
template for Q38.

{code}
[_LIMITA] select [_LIMITB] count(*) from (
select distinct c_last_name, c_first_name, d_date
from store_sales, date_dim, customer
  where store_sales.ss_sold_date_sk = date_dim.d_date_sk
  and store_sales.ss_customer_sk = customer.c_customer_sk
  and d_month_seq between [DMS] and [DMS] + 11
  intersect
select distinct c_last_name, c_first_name, d_date
from catalog_sales, date_dim, customer
  where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
  and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
  and d_month_seq between [DMS] and [DMS] + 11
  intersect
select distinct c_last_name, c_first_name, d_date
from web_sales, date_dim, customer
  where web_sales.ws_sold_date_sk = date_dim.d_date_sk
  and web_sales.ws_bill_customer_sk = customer.c_customer_sk
  and d_month_seq between [DMS] and [DMS] + 11
) hot_cust
[_LIMITC];
{code}

In this case the query in spec uses intersect operator where the  implicitly 
generated join conditions use null safe comparison. 
In other-words, if we ran the query as is from spec then it would have worked.

However the query in this JIRA has user supplied join conditions and uses "=". 
In my knowledge in SQL, the semantics
of equal operator is well defined. So i don't think its a spark SQL issue. 

[~rxin] [~marmbrus] Please let us know your thoughts..



> TPCDS query 38 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13859
> URL: https://issues.apache.org/jira/browse/SPARK-13859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: JESSE CHEN
>  Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 38 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 0, answer set reports 107.
> Actual results:
> {noformat}
> [0]
> {noformat}
> Expected:
> {noformat}
> +-+
> |   1 |
> +-+
> | 107 |
> +-+
> {noformat}
> query used:
> {noformat}
> -- start query 38 in stream 0 using template query38.tpl and seed 
> QUALIFICATION
>  select  count(*) from (
> select distinct c_last_name, c_first_name, d_date
> from store_sales
>  JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp1
>   JOIN
> (select distinct c_last_name, c_first_name, d_date
> from catalog_sales
>  JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = 
> tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and 
> (tmp1.d_date = tmp2.d_date) 
>   JOIN
> (
> select distinct c_last_name, c_first_name, d_date
> from web_sales
>  JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
>  JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
> where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = 
> tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and 
> (tmp1.d_date = tmp3.d_date) 
>   limit 100
>  ;
> -- end query 38 in stream 0 using template query38.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13821) TPC-DS Query 20 fails to compile

2016-03-18 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201828#comment-15201828
 ] 

Dilip Biswal commented on SPARK-13821:
--

[~roycecil] Thanks Roy !!

> TPC-DS Query 20 fails to compile
> 
>
> Key: SPARK-13821
> URL: https://issues.apache.org/jira/browse/SPARK-13821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 20 Fails to compile with the follwing Error Message
> {noformat}
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13821) TPC-DS Query 20 fails to compile

2016-03-15 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196464#comment-15196464
 ] 

Dilip Biswal edited comment on SPARK-13821 at 3/16/16 4:41 AM:
---

[~roycecil] Just tried the original query no. 20  against spark 2.0 posted at 
https://ibm.app.box.com/sparksql-tpcds-99-queries/5/6794095390/55341651086/1 .

I could see the same error that is reported in the JIRA. It seems that there is 
an extra comma
in the projection list between two columns like following.

{code}
select  i_item_id,
   ,i_item_desc
{code}

Please note that we ran against 2.0 and not 1.6. Can you please re-run to make 
sure ?


was (Author: dkbiswal):
[~roycecil] Just tried the original query no. 20  against spark 2.0 posted at 
https://ibm.app.box.com/sparksql-tpcds-99-queries/5/6794095390/55341651086/1 .

I could see the same error that is reported in the JIRA. It seems the there is 
an extra comma
in the projection list between two columns like following.

{code}
select  i_item_id,
   ,i_item_desc
{code}

Please note that we ran against 2.0 and not 1.6. Can you please re-run to make 
sure ?

> TPC-DS Query 20 fails to compile
> 
>
> Key: SPARK-13821
> URL: https://issues.apache.org/jira/browse/SPARK-13821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 20 Fails to compile with the follwing Error Message
> {noformat}
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13821) TPC-DS Query 20 fails to compile

2016-03-15 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196464#comment-15196464
 ] 

Dilip Biswal commented on SPARK-13821:
--

[~roycecil] Just tried the original query no. 20  against spark 2.0 posted at 
https://ibm.app.box.com/sparksql-tpcds-99-queries/5/6794095390/55341651086/1 .

I could see the same error that is reported in the JIRA. It seems the there is 
an extra comma
in the projection list between two columns like following.

{code}
select  i_item_id,
   ,i_item_desc
{code}

Please note that we ran against 2.0 and not 1.6. Can you please re-run to make 
sure ?

> TPC-DS Query 20 fails to compile
> 
>
> Key: SPARK-13821
> URL: https://issues.apache.org/jira/browse/SPARK-13821
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Red Hat Enterprise Linux Server release 7.1 (Maipo)
> Linux bigaperf116.svl.ibm.com 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 
> 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Roy Cecil
>
> TPC-DS Query 20 Fails to compile with the follwing Error Message
> {noformat}
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> Parsing error: NoViableAltException(10@[127:1: selectItem : ( ( 
> tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( 
> expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA 
> identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) 
> );])
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA17.specialStateTransition(HiveParser_SelectClauseParser.java:11835)
> at org.antlr.runtime.DFA.predict(DFA.java:80)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2853)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1401)
> at 
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13698) Fix Analysis Exceptions when Using Backticks in Generate

2016-03-05 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-13698:


 Summary: Fix Analysis Exceptions when Using Backticks in Generate
 Key: SPARK-13698
 URL: https://issues.apache.org/jira/browse/SPARK-13698
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Dilip Biswal


Analysis exception occurs while running the following query.
{code}
SELECT ints FROM nestedArray LATERAL VIEW explode(a.b) `a` AS `ints`
{code}
{code}
Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve 
'`ints`' given input columns: [a, `ints`]; line 1 pos 7
'Project ['ints]
+- Generate explode(a#0.b), true, false, Some(a), [`ints`#8]
   +- SubqueryAlias nestedarray
  +- LocalRelation [a#0], 1,2,3
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13651) Generator outputs are not resolved correctly resulting in runtime error

2016-03-03 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-13651:


 Summary: Generator outputs are not resolved correctly resulting in 
runtime error
 Key: SPARK-13651
 URL: https://issues.apache.org/jira/browse/SPARK-13651
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Dilip Biswal


Seq(("id1", "value1")).toDF("key", "value").registerTempTable("src")
sqlContext.sql("SELECT t1.* FROM src LATERAL VIEW explode(map('key1', 100, 
'key2', 200)) t1 AS key, value")

Running above repro results in :

java.lang.ClassCastException: java.lang.Integer cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:221)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:42)
at 
org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:98)
at 
org.apache.spark.sql.execution.Generate$$anonfun$doExecute$1$$anonfun$apply$9.apply(Generate.scala:96)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
at scala.collection.AbstractIterator.to(Iterator.scala:1194)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:876)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:876)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1794)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1794)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:82)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13427) Support USING clause in JOIN

2016-02-21 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-13427:


 Summary: Support USING clause in JOIN
 Key: SPARK-13427
 URL: https://issues.apache.org/jira/browse/SPARK-13427
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Dilip Biswal


Support queries that JOIN tables with USING clause.

SELECT * from table1 JOIN table2 USING 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Welcoming two new committers

2016-02-08 Thread Dilip Biswal

Congratulations Wenchen and Herman !! 

Regards,
Dilip Biswal
Tel: 408-463-4980
dbis...@us.ibm.com



From:   Xiao Li 
To: Corey Nolet 
Cc: Ted Yu , Matei Zaharia 
, dev 
Date:   02/08/2016 09:39 AM
Subject:Re: Welcoming two new committers



Congratulations! Herman and Wenchen!  I am just so happy for you! You 
absolutely deserve it!

2016-02-08 9:35 GMT-08:00 Corey Nolet :
Congrats guys! 

On Mon, Feb 8, 2016 at 12:23 PM, Ted Yu  wrote:
Congratulations, Herman and Wenchen.

On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia  
wrote:
Hi all,

The PMC has recently added two new Spark committers -- Herman van Hovell 
and Wenchen Fan. Both have been heavily involved in Spark SQL and 
Tungsten, adding new features, optimizations and APIs. Please join me in 
welcoming Herman and Wenchen.

Matei
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12988) Can't drop columns that contain dots

2016-02-02 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128902#comment-15128902
 ] 

Dilip Biswal edited comment on SPARK-12988 at 2/2/16 7:56 PM:
--

The subtle difference between column path and column name may not be very 
obvious to a common user of this API. 

val df = Seq((1, 1)).toDF("a_b", "a.b")
df.select("`a.b`")
df.drop("`a.b`") => the fact that one can not use back tick here , would it be 
that obvious to the user ?

I believe that was the motivation to allow it but then i am not sure of its 
implications.


was (Author: dkbiswal):
The shuttle difference between column path and column name may not be very 
obvious to a common user of this API. 

val df = Seq((1, 1)).toDF("a_b", "a.b")
df.select("`a.b`")
df.drop("`a.b`") => the fact that one can not use back tick here , would it be 
that obvious to the user ?

I believe that was the motivation to allow it but then i am not sure of its 
implications.

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots

2016-02-02 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128902#comment-15128902
 ] 

Dilip Biswal commented on SPARK-12988:
--

The shuttle difference between column path and column name may not be very 
obvious to a common user of this API. 

val df = Seq((1, 1)).toDF("a_b", "a.b")
df.select("`a.b`")
df.drop("`a.b`") => the fact that one can not use back tick here , would it be 
that obvious to the user ?

I believe that was the motivation to allow it but then i am not sure of its 
implications.

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots

2016-01-26 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118775#comment-15118775
 ] 

Dilip Biswal commented on SPARK-12988:
--

[~marmbrus][~rxin] Thanks for your input.

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12988) Can't drop columns that contain dots

2016-01-26 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118319#comment-15118319
 ] 

Dilip Biswal edited comment on SPARK-12988 at 1/27/16 12:02 AM:


[~marmbrus] Hi Michael, need your input on the semantics.

Say we have a dataframe defined like following :
val df = Seq((1, 1,1)).toDF("a_b", "a.c", "`a.c`")

df.drop("a.c")  => Should we remove the 2nd column here ? 
df.drop("`a.c`") => Should we remove the 3rd column here ?

Regards,
-- Dilip


was (Author: dkbiswal):
[~marmbrus] Hi Michael, need your input on the semantics.

Say we have a dataframe defined like following :
val df = Seq((1, 1,1,1,1,1)).toDF("a_b", "a.c", "`a.c`")

df.drop("a.c")  => Should we remove the 2nd column here ? 
df.drop("`a.c`") => Should we remove the 3rd column here ?

Regards,
-- Dilip

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots

2016-01-26 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118319#comment-15118319
 ] 

Dilip Biswal commented on SPARK-12988:
--

[~marmbrus] Hi Michael, need your input on the semantics.

Say we have a dataframe defined like following :
val df = Seq((1, 1,1,1,1,1)).toDF("a_b", "a.c", "`a.c`")

df.drop("a.c")  => Should we remove the 2nd column here ? 
df.drop("`a.c`") => Should we remove the 3rd column here ?

Regards,
-- Dilip

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots

2016-01-26 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117509#comment-15117509
 ] 

Dilip Biswal commented on SPARK-12988:
--

I would like to work on this one.

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12558) AnalysisException when multiple functions applied in GROUP BY clause

2015-12-29 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074508#comment-15074508
 ] 

Dilip Biswal commented on SPARK-12558:
--

I would like to work on this one.

> AnalysisException when multiple functions applied in GROUP BY clause
> 
>
> Key: SPARK-12558
> URL: https://issues.apache.org/jira/browse/SPARK-12558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> Hi,
> I have following issue when trying to use functions in group by clause. 
> Example:
> {code}
> sqlCtx = HiveContext(sc)
> rdd = sc.parallelize([{'test_date': 1451400761}])
> df = sqlCtx.createDataFrame(rdd)
> df.registerTempTable("df")
> {code}
> Now, where I'm using single function it's OK.
> {code}
> sqlCtx.sql("select cast(test_date as timestamp) from df group by 
> cast(test_date as timestamp)").collect()
> [Row(test_date=datetime.datetime(2015, 12, 29, 15, 52, 41))]
> {code}
> Where I'm using more than one function I'm getting AnalysisException
> {code}
> sqlCtx.sql("select date(cast(test_date as timestamp)) from df group by 
> date(cast(test_date as timestamp))").collect()
> Py4JJavaError: An error occurred while calling o38.sql.
> : org.apache.spark.sql.AnalysisException: expression 'test_date' is neither 
> present in the group by, nor is it an aggregate function. Add to group by or 
> wrap in first() (or first_value) if you don't care which value you get.;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12458) Add ExpressionDescription to datetime functions

2015-12-21 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067278#comment-15067278
 ] 

Dilip Biswal commented on SPARK-12458:
--

I would like to work on this one.

> Add ExpressionDescription to datetime functions
> ---
>
> Key: SPARK-12458
> URL: https://issues.apache.org/jira/browse/SPARK-12458
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12398) Smart truncation of DataFrame / Dataset toString

2015-12-17 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15062527#comment-15062527
 ] 

Dilip Biswal commented on SPARK-12398:
--

[~rxin] Hi Reynold, are you working on this ? If not, i would like to make a 
try to fix this.

> Smart truncation of DataFrame / Dataset toString
> 
>
> Key: SPARK-12398
> URL: https://issues.apache.org/jira/browse/SPARK-12398
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: starter
>
> When a DataFrame or Dataset has a long schema, we should intelligently 
> truncate to avoid flooding the screen with unreadable information.
> {code}
> // Standard output
> [a: int, b: int]
> // Truncate many top level fields
> [a: int, b, string ... 10 more fields]
> // Truncate long inner structs
> [a: struct]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12359) Add showString() to DataSet API.

2015-12-15 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-12359:


 Summary: Add showString() to DataSet API.
 Key: SPARK-12359
 URL: https://issues.apache.org/jira/browse/SPARK-12359
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Dilip Biswal
Priority: Minor


JIRA 12105 exposed showString and its variants as public API. This adds the two 
APIs into DataSet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12257) Non partitioned insert into a partitioned Hive table doesn't fail

2015-12-10 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15050319#comment-15050319
 ] 

Dilip Biswal commented on SPARK-12257:
--

Was able to reproduce this issue. Looking into it.

> Non partitioned insert into a partitioned Hive table doesn't fail
> -
>
> Key: SPARK-12257
> URL: https://issues.apache.org/jira/browse/SPARK-12257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Mark Grover
>Priority: Minor
>
> I am using Spark 1.5.1 but I anticipate this to be a problem with master as 
> well (will check later).
> I have a dataframe, and a partitioned Hive table that I want to insert the 
> contents of the data frame into.
> Let's say mytable is a non-partitioned Hive table and mytable_partitioned is 
> a partitioned Hive table. In Hive, if you try to insert from the 
> non-partitioned mytable table into mytable_partitioned without specifying the 
> partition, the query fails, as expected:
> {quote}
> INSERT INTO mytable_partitioned SELECT * FROM mytable;
> {quote}
> Error: Error while compiling statement: FAILED: SemanticException 1:12 Need 
> to specify partition columns because the destination table is partitioned. 
> Error encountered near token 'mytable_partitioned' (state=42000,code=4)
> {quote}
> However, if I do the same in Spark SQL:
> {code}
> val myDfTempTable = myDf.registerTempTable("my_df_temp_table")
> sqlContext.sql("INSERT INTO mytable_partitioned SELECT * FROM 
> my_df_temp_table")
> {code}
> This appears to succeed but does no insertion. This should fail with an error 
> stating the data is being inserted into a partitioned table without 
> specifying the name of the partition.
> Of course, the name of the partition is explicitly specified, both Hive and 
> Spark SQL do the right thing and function correctly.
> In hive:
> {code}
> INSERT INTO mytable_partitioned PARTITION (y='abc') SELECT * FROM mytable;
> {code}
> In Spark SQL:
> {code}
> val myDfTempTable = myDf.registerTempTable("my_df_temp_table")
> sqlContext.sql("INSERT INTO mytable_partitioned PARTITION (y='abc') SELECT * 
> FROM my_df_temp_table")
> {code}
> And, here are the definitions of my tables, as reference:
> {code}
> CREATE TABLE mytable(x INT);
> CREATE TABLE mytable_partitioned (x INT) PARTITIONED BY (y INT);
> {code}
> You will also need to insert some dummy data into mytable to ensure that the 
> insertion is actually not working:
> {code}
> #!/bin/bash
> rm -rf data.txt;
> for i in {0..9}; do
> echo $i >> data.txt
> done
> sudo -u hdfs hadoop fs -put data.txt /user/hive/warehouse/mytable
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11949) Query on DataFrame from cube gives wrong results

2015-11-28 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15030718#comment-15030718
 ] 

Dilip Biswal commented on SPARK-11949:
--

I would like to work on this issue.

> Query on DataFrame from cube gives wrong results
> 
>
> Key: SPARK-11949
> URL: https://issues.apache.org/jira/browse/SPARK-11949
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Veli Kerim Celik
>  Labels: dataframe, sql
>
> {code:title=Reproduce bug|borderStyle=solid}
> case class fact(date: Int, hour: Int, minute: Int, room_name: String, temp: 
> Double)
> val df0 = sc.parallelize(Seq
> (
> fact(20151123, 18, 35, "room1", 18.6),
> fact(20151123, 18, 35, "room2", 22.4),
> fact(20151123, 18, 36, "room1", 17.4),
> fact(20151123, 18, 36, "room2", 25.6)
> )).toDF()
> val cube0 = df0.cube("date", "hour", "minute", "room_name").agg(Map
> (
> "temp" -> "avg"
> ))
> cube0.where("date IS NULL").show()
> {code}
> The query result is empty. It should not be, because cube0 contains the value 
> null several times in column 'date'. The issue arises because the cube 
> function reuses the schema information from df0. If I change the type of 
> parameters in the case class to Option[T] the query gives correct results.
> Solution: The cube function should change the schema by changing the nullable 
> property to true, for the columns (dimensions) specified in the method call 
> parameters.
> I am new at Scala and Spark. I don't know how to implement this. Somebody 
> please do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11997) NPE when save a DataFrame as parquet and partitioned by long column

2015-11-26 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028334#comment-15028334
 ] 

Dilip Biswal edited comment on SPARK-11997 at 11/26/15 8:32 AM:


I would like to work on this issue. Currently testing the patch.


was (Author: dkbiswal):
I would like to work on this issue.

> NPE when save a DataFrame as parquet and partitioned by long column
> ---
>
> Key: SPARK-11997
> URL: https://issues.apache.org/jira/browse/SPARK-11997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Priority: Blocker
>
> {code}
> >>> sqlContext.range(1<<20).selectExpr("if(id % 10 = 0, null, (id % 111) - 
> >>> 50) AS n", "id").write.partitionBy("n").parquet("myid3")
> 15/11/25 12:05:57 ERROR InsertIntoHadoopFsRelation: Aborting job.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.InternalRow.getString(InternalRow.scala:32)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:610)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:608)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.Range.foreach(Range.scala:141)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1(interfaces.scala:608)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:616)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:615)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:615)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.refresh(interfaces.scala:590)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh(ParquetRelation.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:152)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:131)
>   at 
> org.ap

[jira] [Commented] (SPARK-11997) NPE when save a DataFrame as parquet and partitioned by long column

2015-11-26 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15028334#comment-15028334
 ] 

Dilip Biswal commented on SPARK-11997:
--

I would like to work on this issue.

> NPE when save a DataFrame as parquet and partitioned by long column
> ---
>
> Key: SPARK-11997
> URL: https://issues.apache.org/jira/browse/SPARK-11997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Priority: Blocker
>
> {code}
> >>> sqlContext.range(1<<20).selectExpr("if(id % 10 = 0, null, (id % 111) - 
> >>> 50) AS n", "id").write.partitionBy("n").parquet("myid3")
> 15/11/25 12:05:57 ERROR InsertIntoHadoopFsRelation: Aborting job.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.InternalRow.getString(InternalRow.scala:32)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:610)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1$1.apply(interfaces.scala:608)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.Range.foreach(Range.scala:141)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$castPartitionValuesToUserSchema$1(interfaces.scala:608)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:616)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions$1.apply(interfaces.scala:615)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:615)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.refresh(interfaces.scala:590)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.refresh(ParquetRelation.scala:204)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:152)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.QueryExecution.

[jira] [Created] (SPARK-11863) Unable to resolve order by if it contains mixture of aliases and real columns.

2015-11-19 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-11863:


 Summary: Unable to resolve order by if it contains mixture of 
aliases and real columns.
 Key: SPARK-11863
 URL: https://issues.apache.org/jira/browse/SPARK-11863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Dilip Biswal


Analyzer is unable to resolve order by if the columns in the order by clause 
contains a mixture of alias and real column names.

Example :
var var3 = sqlContext.sql("select c1 as a, c2 as b from inttab group by c1, c2 
order by  b, c1")

This used to work in 1.4 and is failing starting 1.5 and is affecting some 
tpcds queries (19, 55,71)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11584) The attribute of temporay table shows false

2015-11-09 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997265#comment-14997265
 ] 

Dilip Biswal commented on SPARK-11584:
--

I would like to work on this issue.

> The attribute of temporay table shows false 
> 
>
> Key: SPARK-11584
> URL: https://issues.apache.org/jira/browse/SPARK-11584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jay
>Priority: Minor
>
> After using command " create  temporary table tableName"  to create a  table, 
> then I think the attribute(istemporary) of that table should be true, but 
> actually  "show tables" indicates that the table's attribute is false. So, I 
> am confused and hope somebody can solve this problem.
> Note: The command is just "create temporary table tableName" without "using". 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11577) Handle code review comments for SPARK-11188

2015-11-08 Thread Dilip Biswal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dilip Biswal updated SPARK-11577:
-
Summary: Handle code review comments for SPARK-11188  (was: Suppress 
stacktraces in bin/spark-sql for AnalysisExceptions)

> Handle code review comments for SPARK-11188
> ---
>
> Key: SPARK-11577
> URL: https://issues.apache.org/jira/browse/SPARK-11577
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
>    Reporter: Dilip Biswal
>Priority: Minor
> Fix For: 1.5.2
>
>
> The fix for 11188 handled suppressed printing stack traces to console when 
> encountering AnalysisException. This JIRA completes 11188 by handling code 
> review comments from Michael.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11577) Suppress stacktraces in bin/spark-sql for AnalysisExceptions

2015-11-08 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-11577:


 Summary: Suppress stacktraces in bin/spark-sql for 
AnalysisExceptions
 Key: SPARK-11577
 URL: https://issues.apache.org/jira/browse/SPARK-11577
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.2
Reporter: Dilip Biswal
Priority: Minor
 Fix For: 1.5.2


The fix for 11188 handled suppressed printing stack traces to console when 
encountering AnalysisException. This JIRA completes 11188 by handling code 
review comments from Michael.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11544) sqlContext doesn't use PathFilter

2015-11-07 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14995476#comment-14995476
 ] 

Dilip Biswal edited comment on SPARK-11544 at 11/8/15 4:52 AM:
---

I would like to work on this issue.


was (Author: dkbiswal):
I am looking into this issue.

> sqlContext doesn't use PathFilter
> -
>
> Key: SPARK-11544
> URL: https://issues.apache.org/jira/browse/SPARK-11544
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: AWS EMR 4.1.0, Spark 1.5.0
>Reporter: Frank Dai
>
> When sqlContext reads JSON files, it doesn't use {{PathFilter}} in the 
> underlying SparkContext
> {code:java}
> val sc = new SparkContext(conf)
> sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", 
> classOf[TmpFileFilter], classOf[PathFilter])
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> {code}
> The definition of {{TmpFileFilter}} is:
> {code:title=TmpFileFilter.scala|borderStyle=solid}
> import org.apache.hadoop.fs.{Path, PathFilter}
> class TmpFileFilter  extends PathFilter {
>   override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
> }
> {code}
> When use {{sqlContext}} to read JSON files, e.g., 
> {{sqlContext.read.schema(mySchema).json(s3Path)}}, Spark will throw out an 
> exception:
> {quote}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> s3://chef-logstash-access-backup/2015/10/21/00/logstash-172.18.68.59-s3.1445388158944.gz.tmp
> {quote}
> It seems {{sqlContext}} can see {{.tmp}} files while {{sc}} can not, which 
> causes the above exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11544) sqlContext doesn't use PathFilter

2015-11-07 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14995476#comment-14995476
 ] 

Dilip Biswal commented on SPARK-11544:
--

I am looking into this issue.

> sqlContext doesn't use PathFilter
> -
>
> Key: SPARK-11544
> URL: https://issues.apache.org/jira/browse/SPARK-11544
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: AWS EMR 4.1.0, Spark 1.5.0
>Reporter: Frank Dai
>
> When sqlContext reads JSON files, it doesn't use {{PathFilter}} in the 
> underlying SparkContext
> {code:java}
> val sc = new SparkContext(conf)
> sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", 
> classOf[TmpFileFilter], classOf[PathFilter])
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> {code}
> The definition of {{TmpFileFilter}} is:
> {code:title=TmpFileFilter.scala|borderStyle=solid}
> import org.apache.hadoop.fs.{Path, PathFilter}
> class TmpFileFilter  extends PathFilter {
>   override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
> }
> {code}
> When use {{sqlContext}} to read JSON files, e.g., 
> {{sqlContext.read.schema(mySchema).json(s3Path)}}, Spark will throw out an 
> exception:
> {quote}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> s3://chef-logstash-access-backup/2015/10/21/00/logstash-172.18.68.59-s3.1445388158944.gz.tmp
> {quote}
> It seems {{sqlContext}} can see {{.tmp}} files while {{sc}} can not, which 
> causes the above exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Master build fails ?

2015-11-05 Thread Dilip Biswal

Hello Ted,

Thanks for your response.

Here is the command i used :

build/sbt clean
build/sbt -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
-Dhadoop.version=2.6.0 -DskipTests assembly

I am building on CentOS and on master branch.

One other thing, i was able to build fine with the above command up until 
recently. I think i have stared
to have problem after SPARK-11073 where the HashCodes import was added.

Regards,
Dilip Biswal
Tel: 408-463-4980
dbis...@us.ibm.com



From:   Ted Yu 
To: Dilip Biswal/Oakland/IBM@IBMUS
Cc: Jean-Baptiste Onofré , "dev@spark.apache.org" 

Date:   11/05/2015 10:46 AM
Subject:Re: Master build fails ?



Dilip:
Can you give the command you used ?

Which release were you building ?
What OS did you build on ?

Cheers

On Thu, Nov 5, 2015 at 10:21 AM, Dilip Biswal  wrote:
Hello,

I am getting the same build error about not being able to find 
com.google.common.hash.HashCodes.

Is there a solution to this ?

Regards,
Dilip Biswal
Tel: 408-463-4980
dbis...@us.ibm.com



From:Jean-Baptiste Onofré 
To:Ted Yu 
Cc:"dev@spark.apache.org" 
Date:11/03/2015 07:20 AM
Subject:Re: Master build fails ?



Hi Ted,

thanks for the update. The build with sbt is in progress on my box.

Regards
JB

On 11/03/2015 03:31 PM, Ted Yu wrote:
> Interesting, Sbt builds were not all failing:
>
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/
>
> FYI
>
> On Tue, Nov 3, 2015 at 5:58 AM, Jean-Baptiste Onofré  <mailto:j...@nanthrax.net>> wrote:

>
> Hi Jacek,
>
> it works fine with mvn: the problem is with sbt.
>
> I suspect a different reactor order in sbt compare to mvn.
>
> Regards
> JB
>
> On 11/03/2015 02:44 PM, Jacek Laskowski wrote:
>
> Hi,
>
> Just built the sources using the following command and it worked
> fine.
>
> ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.7.1 -Dscala-2.11 -Phive -Phive-thriftserver
> -DskipTests clean install
> ...
> [INFO]
> 

> [INFO] BUILD SUCCESS
> [INFO]
> 

> [INFO] Total time: 14:15 min
> [INFO] Finished at: 2015-11-03T14:40:40+01:00
> [INFO] Final Memory: 438M/1972M
> [INFO]
> 

>
> ➜  spark git:(master) ✗ java -version
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>
> I'm on Mac OS.
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | 
http://blog.japila.pl|
> http://blog.jaceklaskowski.pl

> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski

>
>
> On Tue, Nov 3, 2015 at 1:37 PM, Jean-Baptiste Onofré
> mailto:j...@nanthrax.net>> wrote:
>
> Thanks for the update, I used mvn to build but without hive
> profile.
>
> Let me try with mvn with the same options as you and sbt 
also.
>
> I keep you posted.
>
> Regards
> JB
>
> On 11/03/2015 12:55 PM, Jeff Zhang wrote:
>
>
> I found it is due to SPARK-11073.
>
> Here's the command I used to build
>
> build/sbt clean compile -Pyarn -Phadoop-2.6 -Phive
> -Phive-thriftserver
> -Psparkr
>
> On Tue, Nov 3, 2015 at 7:52 PM, Jean-Baptiste Onofré
> mailto:j...@nanthrax.net>
> <mailto:j...@nanthrax.net<mailto:j...@nanthrax.net>>> wrote:

>
>   Hi Jeff,
>
>   it works for me (with skipping the tests).
>
>   Let me try again, just to be sure.
>
>   Regards
>   JB
>
>
>   On 11/03/2015 11:50 AM, Jeff Zhang wrote:
>
>   Looks like it's due to guava version
> conflicts, I see both guava
>   14.0.1
>   and 16.0.1 under lib_managed/bundles. Anyone
> meet this issue too ?
>
>   [error]
>
> 
/Users/jzhang/github/spark_apache/cor

Re: Master build fails ?

2015-11-05 Thread Dilip Biswal

Hello,

I am getting the same build error about not being able to find 
com.google.common.hash.HashCodes.

Is there a solution to this ?

Regards,
Dilip Biswal
Tel: 408-463-4980
dbis...@us.ibm.com



From:   Jean-Baptiste Onofré 
To: Ted Yu 
Cc: "dev@spark.apache.org" 
Date:   11/03/2015 07:20 AM
Subject:Re: Master build fails ?



Hi Ted,

thanks for the update. The build with sbt is in progress on my box.

Regards
JB

On 11/03/2015 03:31 PM, Ted Yu wrote:
> Interesting, Sbt builds were not all failing:
>
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/
>
> FYI
>
> On Tue, Nov 3, 2015 at 5:58 AM, Jean-Baptiste Onofré  <mailto:j...@nanthrax.net>> wrote:
>
> Hi Jacek,
>
> it works fine with mvn: the problem is with sbt.
>
> I suspect a different reactor order in sbt compare to mvn.
>
> Regards
> JB
>
> On 11/03/2015 02:44 PM, Jacek Laskowski wrote:
>
> Hi,
>
> Just built the sources using the following command and it worked
> fine.
>
> ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.7.1 -Dscala-2.11 -Phive -Phive-thriftserver
> -DskipTests clean install
> ...
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 14:15 min
> [INFO] Finished at: 2015-11-03T14:40:40+01:00
> [INFO] Final Memory: 438M/1972M
> [INFO]
> 
>
> ➜  spark git:(master) ✗ java -version
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>
> I'm on Mac OS.
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | http://blog.japila.pl |
> http://blog.jaceklaskowski.pl
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Tue, Nov 3, 2015 at 1:37 PM, Jean-Baptiste Onofré
> mailto:j...@nanthrax.net>> wrote:
>
> Thanks for the update, I used mvn to build but without hive
> profile.
>
> Let me try with mvn with the same options as you and sbt 
also.
>
> I keep you posted.
>
> Regards
> JB
>
> On 11/03/2015 12:55 PM, Jeff Zhang wrote:
>
>
> I found it is due to SPARK-11073.
>
> Here's the command I used to build
>
> build/sbt clean compile -Pyarn -Phadoop-2.6 -Phive
> -Phive-thriftserver
> -Psparkr
>
> On Tue, Nov 3, 2015 at 7:52 PM, Jean-Baptiste Onofré
> mailto:j...@nanthrax.net>
> <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> 
wrote:
>
>   Hi Jeff,
>
>   it works for me (with skipping the tests).
>
>   Let me try again, just to be sure.
>
>   Regards
>   JB
>
>
>   On 11/03/2015 11:50 AM, Jeff Zhang wrote:
>
>   Looks like it's due to guava version
> conflicts, I see both guava
>   14.0.1
>   and 16.0.1 under lib_managed/bundles. Anyone
> meet this issue too ?
>
>   [error]
>
> 
/Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:26:
>   object HashCodes is not a member of package
> com.google.common.hash
>   [error] import 
com.google.common.hash.HashCodes
>   [error]^
>   [info] Resolving
> org.apache.commons#commons-math;2.2 ...
>   [error]
>
> 
/Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:384:
>   not found: value HashCodes
>   [error] val cookie =
> HashCodes.fromBytes(secret).toString()
>   [error]  ^
>
>
>
>
>   --
>   Best Regards
>
>

Re: SPARK SQL Error

2015-10-15 Thread Dilip Biswal

Hi Giri,

You are perhaps  missing the "--files" option before the supplied hdfs 
file name ?

spark-submit --master yarn --class org.spark.apache.CsvDataSource
/home/cloudera/Desktop/TestMain.jar  --files 
hdfs://quickstart.cloudera:8020/people_csv

Please refer to Ritchard's comments on why the --files option may be 
redundant in 
your case. 

Regards,
Dilip Biswal
Tel: 408-463-4980
dbis...@us.ibm.com



From:   Giri 
To: user@spark.apache.org
Date:   10/15/2015 02:44 AM
Subject:Re: SPARK SQL Error



Hi Ritchard,

Thank you so much  again for your input.This time I ran the command in the
below way
spark-submit --master yarn --class org.spark.apache.CsvDataSource
/home/cloudera/Desktop/TestMain.jar 
hdfs://quickstart.cloudera:8020/people_csv
But I am facing the new error "Could not parse Master URL:
'hdfs://quickstart.cloudera:8020/people_csv'"
file path is correct
 
hadoop fs -ls hdfs://quickstart.cloudera:8020/people_csv
-rw-r--r--   1 cloudera supergroup 29 2015-10-10 00:02
hdfs://quickstart.cloudera:8020/people_csv

Can you help me to fix this new error

15/10/15 02:24:39 INFO spark.SparkContext: Added JAR
file:/home/cloudera/Desktop/TestMain.jar at
http://10.0.2.15:40084/jars/TestMain.jar with timestamp 1444901079484
Exception in thread "main" org.apache.spark.SparkException: Could not 
parse
Master URL: 'hdfs://quickstart.cloudera:8020/people_csv'
 at
org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2244)
 at 
org.apache.spark.SparkContext.(SparkContext.scala:361)
 at 
org.apache.spark.SparkContext.(SparkContext.scala:154)
 at 
org.spark.apache.CsvDataSource$.main(CsvDataSource.scala:10)
 at 
org.spark.apache.CsvDataSource.main(CsvDataSource.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
 at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
 at 
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
 at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
 at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Thanks & Regards,
Giri.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-Error-tp25050p25075.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[jira] [Commented] (SPARK-10943) NullType Column cannot be written to Parquet

2015-10-14 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958109#comment-14958109
 ] 

Dilip Biswal commented on SPARK-10943:
--

Hi Jason,

>From the parquet format page , here are the data types thats supported in 
>parquet.

BOOLEAN: 1 bit boolean
INT32: 32 bit signed ints
INT64: 64 bit signed ints
INT96: 96 bit signed ints
FLOAT: IEEE 32-bit floating point values
DOUBLE: IEEE 64-bit floating point values
BYTE_ARRAY: arbitrarily long byte arrays.

In your test case , you are trying to write an un-typed null value and there is 
no
mapping between this  type (NullType) to the builtin types supported by parquet.

[~marmbrus] is this a valid scenario ?

Regards,
-- Dilip


> NullType Column cannot be written to Parquet
> 
>
> Key: SPARK-10943
> URL: https://issues.apache.org/jira/browse/SPARK-10943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jason Pohl
>
> var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null 
> as comments")
> //FAIL - Try writing a NullType column (where all the values are NULL)
> data02.write.parquet("/tmp/celtra-test/dataset2")
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 179.0 (TID 39924, 10.0.196.208): 
> org.apache.spark.sql.AnalysisException: Unsupported data type 
> StructField(comments,NullType,true).dataType;
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at org.apache.spark.sql.types.Stru

[jira] [Created] (SPARK-11024) Optimize NULL in by folding it to Literal(null)

2015-10-09 Thread Dilip Biswal (JIRA)

Dilip Biswal created SPARK-11024:


 Summary: Optimize NULL in  by folding it to 
Literal(null)
 Key: SPARK-11024
 URL: https://issues.apache.org/jira/browse/SPARK-11024
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Dilip Biswal
Priority: Minor


Add a rule in optimizer to convert NULL [NOT] IN (expr1,...,expr2) to
Literal(null). 

This is a follow up defect to SPARK-8654 and suggested by Wenchen Fan.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11024) Optimize NULL in by folding it to Literal(null)

2015-10-09 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950124#comment-14950124
 ] 

Dilip Biswal commented on SPARK-11024:
--

I am currently working on a PR for this issue.

> Optimize NULL in  by folding it to Literal(null)
> 
>
> Key: SPARK-11024
> URL: https://issues.apache.org/jira/browse/SPARK-11024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dilip Biswal
>Priority: Minor
>
> Add a rule in optimizer to convert NULL [NOT] IN (expr1,...,expr2) to
> Literal(null). 
> This is a follow up defect to SPARK-8654 and suggested by Wenchen Fan.
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10534) ORDER BY clause allows only columns that are present in SELECT statement

2015-10-05 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944462#comment-14944462
 ] 

Dilip Biswal commented on SPARK-10534:
--

I would like to work on this.

> ORDER BY clause allows only columns that are present in SELECT statement
> 
>
> Key: SPARK-10534
> URL: https://issues.apache.org/jira/browse/SPARK-10534
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michal Cwienczek
>
> When invoking query SELECT EmployeeID from Employees order by YEAR(HireDate) 
> Spark 1.5 throws exception:
> {code}
> cannot resolve 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given 
> input columns EmployeeID; line 2 pos 14 StackTrace: 
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'MsSqlNorthwindJobServerTested_dbo_Employees.HireDate' given input columns 
> EmployeeID; line 2 pos 14
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$7.apply(TreeNode.scala:268)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:266)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

[jira] [Commented] (SPARK-8654) Analysis exception when using "NULL IN (...)": invalid cast

2015-09-28 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934731#comment-14934731
 ] 

Dilip Biswal commented on SPARK-8654:
-

I would like to work on this issue.. 

> Analysis exception when using "NULL IN (...)": invalid cast
> ---
>
> Key: SPARK-8654
> URL: https://issues.apache.org/jira/browse/SPARK-8654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Santiago M. Mola
>Priority: Minor
>
> The following query throws an analysis exception:
> {code}
> SELECT * FROM t WHERE NULL NOT IN (1, 2, 3);
> {code}
> The exception is:
> {code}
> org.apache.spark.sql.AnalysisException: invalid cast from int to null;
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:66)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
> {code}
> Here is a test that can be added to AnalysisSuite to check the issue:
> {code}
>   test("SPARK- regression test") {
> val plan = Project(Alias(In(Literal(null), Seq(Literal(1), Literal(2))), 
> "a")() :: Nil,
>   LocalRelation()
> )
> caseInsensitiveAnalyze(plan)
>   }
> {code}
> Note that this kind of query is a corner case, but it is still valid SQL. An 
> expression such as "NULL IN (...)" or "NULL NOT IN (...)" always gives NULL 
> as a result, even if the list contains NULL. So it is safe to translate these 
> expressions to Literal(null) during analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3

201 - 287 of 287 matches

Mail list logo