date:20160929

[jira] [Assigned] (SPARK-17736) Update R README for rmarkdown, pandoc

2016-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17736:


Assignee: (was: Apache Spark)

> Update R README for rmarkdown, pandoc
> -
>
> Key: SPARK-17736
> URL: https://issues.apache.org/jira/browse/SPARK-17736
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> To build R docs (which are built when R tests are run), users need to install 
> pandoc and rmarkdown.  This was done for Jenkins in [SPARK-17420].
> We should document these dependencies here:
> [https://github.com/apache/spark/blob/master/docs/README.md#prerequisites]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17736) Update R README for rmarkdown, pandoc

2016-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535081#comment-15535081
 ] 

Apache Spark commented on SPARK-17736:
--

User 'jagadeesanas2' has created a pull request for this issue:
https://github.com/apache/spark/pull/15309

> Update R README for rmarkdown, pandoc
> -
>
> Key: SPARK-17736
> URL: https://issues.apache.org/jira/browse/SPARK-17736
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> To build R docs (which are built when R tests are run), users need to install 
> pandoc and rmarkdown.  This was done for Jenkins in [SPARK-17420].
> We should document these dependencies here:
> [https://github.com/apache/spark/blob/master/docs/README.md#prerequisites]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17736) Update R README for rmarkdown, pandoc

2016-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17736:


Assignee: Apache Spark

> Update R README for rmarkdown, pandoc
> -
>
> Key: SPARK-17736
> URL: https://issues.apache.org/jira/browse/SPARK-17736
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> To build R docs (which are built when R tests are run), users need to install 
> pandoc and rmarkdown.  This was done for Jenkins in [SPARK-17420].
> We should document these dependencies here:
> [https://github.com/apache/spark/blob/master/docs/README.md#prerequisites]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17717) Add existence checks to user facing catalog

2016-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535062#comment-15535062
 ] 

Apache Spark commented on SPARK-17717:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/15308

> Add existence checks to user facing catalog
> ---
>
> Key: SPARK-17717
> URL: https://issues.apache.org/jira/browse/SPARK-17717
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.1.0
>
>
> It is currently quite cumbersome if an object exists using the (user facing) 
> Catalog. Lets add a few methods for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17737) cannot import name accumulators error

2016-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17737:
--
Flags:   (was: Important)

I suspect it's a problem with how you've got ipython set up to run pyspark. If 
pyspark by itself works, then that should confirm it. That wouldn't be a Spark 
issue per se.

> cannot import name accumulators error
> -
>
> Key: SPARK-17737
> URL: https://issues.apache.org/jira/browse/SPARK-17737
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
> Environment: unix
> python 2.7
>Reporter: Pruthveej Reddy Kasarla
>
> Hi I am trying to setup my sparkcontext using the below code
> import sys
> sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python/build')
> sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python')
> from pyspark import SparkConf, SparkContext
> sconf = SparkConf()
> sc = SparkContext(conf=sconf)
> print sc
> got below error
> ImportError   Traceback (most recent call last)
>  in ()
>   2 sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python/build')
>   3 sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python')
> > 4 from pyspark import SparkConf, SparkContext
>   5 sconf = SparkConf()
>   6 sc = SparkContext(conf=sconf)
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/__init__.py in ()
>  39 
>  40 from pyspark.conf import SparkConf
> ---> 41 from pyspark.context import SparkContext
>  42 from pyspark.rdd import RDD
>  43 from pyspark.files import SparkFiles
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py in ()
>  26 from tempfile import NamedTemporaryFile
>  27 
> ---> 28 from pyspark import accumulators
>  29 from pyspark.accumulators import Accumulator
>  30 from pyspark.broadcast import Broadcast
> ImportError: cannot import name accumulators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-09-29 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535023#comment-15535023
 ] 

Herman van Hovell commented on SPARK-17728:
---

I think calling {{explain(true)}} on your plans helps to understand what is 
going on.

Spark executes the UDF 3x times because the optimizer collapses subsequent 
projects (a project normally being much more expensive than a UDF call). In 
your case the three projects get rewritten into one project, and the 
expressions are rewritten in the following form:
{noformat}
structured_information -> fUdf('a)
plus_one -> fUdf('a).get("plusOne")
squared -> fUdf('a).get("squared")
{noformat}

It is a bit tricky to get around this, this might work:
{noformat}
val exploded = as
   .withColumn("structured_information", explode(array(fUdf('a
.withColumn("plus_one", 'structured_information("plusOne"))
.withColumn("squared", 'structured_information("squared"))
{noformat}

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-09-29 Thread Jacob Eisinger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534923#comment-15534923
 ] 

Jacob Eisinger commented on SPARK-17728:


Thanks for the explanation, but I still think this is an issue.

If Spark assumed their was no side effects and optimize accordingly, their 
would be not issue: the UDF would be called once per row (1).  However, Spark 
calls a costly function many times leading to inefficiency.

In our production code, we have a function that takes in a long string and 
classifies it under a number of different dimensions.  This is a very CPU 
intensive operation and is a pure function .  Obviously, if Spark's optimizer 
calls the functions multiple times, this is _not_ optimal in this scenario.

I think it is intuitive to most that the following code would call the UDF once 
per row (1):
{code}
val exploded = as
  .withColumn("structured_information", fUdf('a))
  .withColumn("plus_one", 'structured_information("plusOne"))
  .withColumn("squared", 'structured_information("squared"))
{code}
However, Spark calls the UDF three times per row!  Is this what you would 
expect?  What am I missing?

(1) - "Once per row" - except when the row needs to recomputed such as when 
workers are lost. 
(2) - I attempted to model the long operation via Thread.sleep(); as you 
mentioned this does have a slight side effect.  Maybe I should have summed the 
first billion counting numbers to illustrate the slow down?

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-29 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534910#comment-15534910
 ] 

Xiao Li commented on SPARK-17709:
-

[~ashrowty]Can you share the exact way how you load the external table? 

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17731) Metrics for Structured Streaming

2016-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534905#comment-15534905
 ] 

Apache Spark commented on SPARK-17731:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/15307

> Metrics for Structured Streaming
> 
>
> Key: SPARK-17731
> URL: https://issues.apache.org/jira/browse/SPARK-17731
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Metrics are needed for monitoring structured streaming apps. Here is the 
> design doc for implementing the necessary metrics.
> https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17741) Grammar to parse top level and nested data fields separately

2016-09-29 Thread Tejas Patil (JIRA)

Tejas Patil created SPARK-17741:
---

 Summary: Grammar to parse top level and nested data fields 
separately
 Key: SPARK-17741
 URL: https://issues.apache.org/jira/browse/SPARK-17741
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Tejas Patil
Priority: Trivial


Based on discussion over the dev list:

{noformat}
Is there any reason why Spark SQL supports "" ":" "" 
while specifying columns ?
eg. sql("CREATE TABLE t1 (column1:INT)") works fine. 
Here is relevant snippet in the grammar [0]:

```
colType
: identifier ':'? dataType (COMMENT STRING)?
;
```

I do not see MySQL[1], Hive[2], Presto[3] and PostgreSQL [4] supporting ":" 
while specifying columns.
They all use space as a delimiter.

[0] : 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L596
[1] : http://dev.mysql.com/doc/refman/5.7/en/create-table.html
[2] : 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
[3] : https://prestodb.io/docs/current/sql/create-table.html
[4] : https://www.postgresql.org/docs/9.1/static/sql-createtable.html
{noformat}

Herman's response:

{noformat}
This is because we use the same rule to parse top level and nested data fields. 
For example:

create table tbl_x(
  id bigint,
  nested struct
)

Shows both syntaxes. We should split this rule in a top-level and nested rule.
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17697) BinaryLogisticRegressionSummary, GLM Summary should handle non-Double numeric types

2016-09-29 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17697:
--
Summary: BinaryLogisticRegressionSummary, GLM Summary should handle 
non-Double numeric types  (was: BinaryLogisticRegressionSummary should handle 
non-Double numeric types)

> BinaryLogisticRegressionSummary, GLM Summary should handle non-Double numeric 
> types
> ---
>
> Key: SPARK-17697
> URL: https://issues.apache.org/jira/browse/SPARK-17697
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>
> Say you have a DataFrame with a label column of Integer type.  You can fit a 
> LogisticRegresionModel since LR handles casting to DoubleType internally.
> However, if you call evaluate() on it, then this line does not handle casting 
> properly, so you get a runtime error (MatchError) for an invalid schema: 
> [https://github.com/apache/spark/blob/2cd327ef5e4c3f6b8468ebb2352479a1686b7888/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L863]
> We should handle casting.  And test evaluate() with other numeric types.
> **ALSO** We should check elsewhere in logreg and other algorithms to see if 
> we can catch the same issue elsewhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17697) BinaryLogisticRegressionSummary, GLM Summary should handle non-Double numeric types

2016-09-29 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534847#comment-15534847
 ] 

Joseph K. Bradley commented on SPARK-17697:
---

Leaving open for follow-up to fix all GLM issues

> BinaryLogisticRegressionSummary, GLM Summary should handle non-Double numeric 
> types
> ---
>
> Key: SPARK-17697
> URL: https://issues.apache.org/jira/browse/SPARK-17697
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>
> Say you have a DataFrame with a label column of Integer type.  You can fit a 
> LogisticRegresionModel since LR handles casting to DoubleType internally.
> However, if you call evaluate() on it, then this line does not handle casting 
> properly, so you get a runtime error (MatchError) for an invalid schema: 
> [https://github.com/apache/spark/blob/2cd327ef5e4c3f6b8468ebb2352479a1686b7888/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L863]
> We should handle casting.  And test evaluate() with other numeric types.
> **ALSO** We should check elsewhere in logreg and other algorithms to see if 
> we can catch the same issue elsewhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17737) cannot import name accumulators error

2016-09-29 Thread Pruthveej Reddy Kasarla (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534839#comment-15534839
 ] 

Pruthveej Reddy Kasarla commented on SPARK-17737:
-

I am trying to set up spark context.
I juz started setting up ipython notebook.
Got the error msg while trying to run the code to set up spark context on 
ipython notebook 

Sent from my iPhone



> cannot import name accumulators error
> -
>
> Key: SPARK-17737
> URL: https://issues.apache.org/jira/browse/SPARK-17737
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
> Environment: unix
> python 2.7
>Reporter: Pruthveej Reddy Kasarla
>
> Hi I am trying to setup my sparkcontext using the below code
> import sys
> sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python/build')
> sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python')
> from pyspark import SparkConf, SparkContext
> sconf = SparkConf()
> sc = SparkContext(conf=sconf)
> print sc
> got below error
> ImportError   Traceback (most recent call last)
>  in ()
>   2 sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python/build')
>   3 sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python')
> > 4 from pyspark import SparkConf, SparkContext
>   5 sconf = SparkConf()
>   6 sc = SparkContext(conf=sconf)
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/__init__.py in ()
>  39 
>  40 from pyspark.conf import SparkConf
> ---> 41 from pyspark.context import SparkContext
>  42 from pyspark.rdd import RDD
>  43 from pyspark.files import SparkFiles
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py in ()
>  26 from tempfile import NamedTemporaryFile
>  27 
> ---> 28 from pyspark import accumulators
>  29 from pyspark.accumulators import Accumulator
>  30 from pyspark.broadcast import Broadcast
> ImportError: cannot import name accumulators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-17728) UDFs are run too many times

2016-09-29 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-17728.
-
Resolution: Not A Problem

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-09-29 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534792#comment-15534792
 ] 

Herman van Hovell commented on SPARK-17728:
---

I am going to close this as not a problem, but feel free to follow up.

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-09-29 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534790#comment-15534790
 ] 

Herman van Hovell commented on SPARK-17728:
---

Spark assumes UDF's are pure function; we do not guarantee that a function is 
only executed once. This is due to the way the optimizer works, and the fact 
that sometimes retry stages. We could add a flag to UDF to prevent this from 
working, but this would be a considerable engineering effort.

The example you give is not really a pure function, as its side effects makes 
the thread stop (changes state). 

If you are connecting to an external service, then I would suggest using 
{{Dataset.mapPartitions(...)}} (similar to a generator). This will allow you to 
setup one connection per partition, and you can call a method as much or as 
little as you like.


> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-29 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-17556:

Attachment: (was: executor-side-broadcast.pdf)

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-29 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-17556:

Attachment: executor-side-broadcast.pdf

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-29 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-17556:

Attachment: executor-side-broadcast.pdf

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-29 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-17556:

Attachment: (was: executor-side-broadcast.pdf)

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-29 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-17556:

Attachment: (was: executor-side-broadcast.pdf)

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-29 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-17556:

Attachment: executor-side-broadcast.pdf

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-29 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534774#comment-15534774
 ] 

Liang-Chi Hsieh commented on SPARK-17556:
-

Update the design document to add more description for using this feature and 
new config for it.

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-29 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-17556:

Attachment: (was: executor-side-broadcast.pdf)

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-29 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-17556:

Attachment: executor-side-broadcast.pdf

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf, 
> executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17740) Spark tests should mock / interpose HDFS to ensure that streams are closed

2016-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534730#comment-15534730
 ] 

Apache Spark commented on SPARK-17740:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/15306

> Spark tests should mock / interpose HDFS to ensure that streams are closed
> --
>
> Key: SPARK-17740
> URL: https://issues.apache.org/jira/browse/SPARK-17740
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Reporter: Eric Liang
>
> As a followup to SPARK-17666, we should add a test to ensure filesystem 
> connections are not leaked at least in unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17740) Spark tests should mock / interpose HDFS to ensure that streams are closed

2016-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17740:


Assignee: (was: Apache Spark)

> Spark tests should mock / interpose HDFS to ensure that streams are closed
> --
>
> Key: SPARK-17740
> URL: https://issues.apache.org/jira/browse/SPARK-17740
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Reporter: Eric Liang
>
> As a followup to SPARK-17666, we should add a test to ensure filesystem 
> connections are not leaked at least in unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17740) Spark tests should mock / interpose HDFS to ensure that streams are closed

2016-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17740:


Assignee: Apache Spark

> Spark tests should mock / interpose HDFS to ensure that streams are closed
> --
>
> Key: SPARK-17740
> URL: https://issues.apache.org/jira/browse/SPARK-17740
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> As a followup to SPARK-17666, we should add a test to ensure filesystem 
> connections are not leaked at least in unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17740) Spark tests should mock / interpose HDFS to ensure that streams are closed

2016-09-29 Thread Eric Liang (JIRA)

Eric Liang created SPARK-17740:
--

 Summary: Spark tests should mock / interpose HDFS to ensure that 
streams are closed
 Key: SPARK-17740
 URL: https://issues.apache.org/jira/browse/SPARK-17740
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Reporter: Eric Liang


As a followup to SPARK-17666, we should add a test to ensure filesystem 
connections are not leaked at least in unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17717) Add existence checks to user facing catalog

2016-09-29 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17717.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Add existence checks to user facing catalog
> ---
>
> Key: SPARK-17717
> URL: https://issues.apache.org/jira/browse/SPARK-17717
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.1.0
>
>
> It is currently quite cumbersome if an object exists using the (user facing) 
> Catalog. Lets add a few methods for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-09-29 Thread Jo Desmet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534603#comment-15534603
 ] 

Jo Desmet edited comment on SPARK-15343 at 9/30/16 12:53 AM:
-

I think this issue has not been properly addressed *with* Spark, and should be 
reopened.
We are bound by current versions of Hadoop, and can just not simply ignore 
users running on top of a Yarn framework. Compatibility with Yarn should take 
precedence over unleashing new features. Coordination and compatibility with 
Yarn/Hadoop is paramount. We are already building against specific versions of 
Hadoop, this should be the case here too, and once the opportunity arises, we 
can start supporting Jersey 2 *by choice*. The base of existing intalled Hadoop 
Frameworks is just to big.


was (Author: jo_des...@yahoo.com):
I think this issue has not been properly addressed *with* Spark, and should be 
reopened.
We are bound by current versions of Hadoop, and can just not simply ignore 
users running on top of a Yarn framework. Compatibility with Yarn should take 
precedence over unleashing new features. Coordination and compatibility with 
Yarn/Hadoop is paramount. What possibly could happen is pushing the hadoop 
libraries for jersey in a different namespace - a custom repackaging of the 
library. But I guess once you start that, you can as well up the Jersey version.
We are already building against specific versions of Hadoop, this should be the 
case here too, and once the opportunity arises, we can start supporting Jersey 
2, but Now is not the time.

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
>

[jira] [Comment Edited] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-09-29 Thread Jo Desmet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534603#comment-15534603
 ] 

Jo Desmet edited comment on SPARK-15343 at 9/30/16 12:51 AM:
-

I think this issue has not been properly addressed *with* Spark, and should be 
reopened.
We are bound by current versions of Hadoop, and can just not simply ignore 
users running on top of a Yarn framework. Compatibility with Yarn should take 
precedence over unleashing new features. Coordination and compatibility with 
Yarn/Hadoop is paramount. What possibly could happen is pushing the hadoop 
libraries for jersey in a different namespace - a custom repackaging of the 
library. But I guess once you start that, you can as well up the Jersey version.
We are already building against specific versions of Hadoop, this should be the 
case here too, and once the opportunity arises, we can start supporting Jersey 
2, but Now is not the time.


was (Author: jo_des...@yahoo.com):
I think this issue has not been properly addressed, and should be reopened.
We are bound by current versions of Hadoop, and can just not simply ignore 
users running on top of a Yarn framework. Compatibility with Yarn should take 
precedence over unleashing new features. Coordination and compatibility with 
Yarn/Hadoop is paramount. What possibly could happen is pushing the hadoop 
libraries for jersey in a different namespace - a custom repackaging of the 
library. But I guess once you start that, you can as well up the Jersey version.
We are already building against specific versions of Hadoop, this should be the 
case here too, and once the opportunity arises, we can start supporting Jersey 
2, but Now is not the time.

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
>

[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-09-29 Thread Jo Desmet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534607#comment-15534607
 ] 

Jo Desmet commented on SPARK-15343:
---

Hadoop Yarn is not 'just' 3rd party. It is an important framework to run Spark 
on. Compatibility is paramount. 

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: 
> com.sun.jersey.api.client.config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at

[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-09-29 Thread Jo Desmet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534603#comment-15534603
 ] 

Jo Desmet commented on SPARK-15343:
---

I think this issue has not been properly addressed, and should be reopened.
We are bound by current versions of Hadoop, and can just not simply ignore 
users running on top of a Yarn framework. Compatibility with Yarn should take 
precedence over unleashing new features. Coordination and compatibility with 
Yarn/Hadoop is paramount. What possibly could happen is pushing the hadoop 
libraries for jersey in a different namespace - a custom repackaging of the 
library. But I guess once you start that, you can as well up the Jersey version.
We are already building against specific versions of Hadoop, this should be the 
case here too, and once the opportunity arises, we can start supporting Jersey 
2, but Now is not the time.

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
>

[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-09-29 Thread Jacob Eisinger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534583#comment-15534583
 ] 

Jacob Eisinger commented on SPARK-17728:


I am a little confused.
# Could you explain how a generator would apply here?
# You mentioned that UDFs should be pure functions.  Is Spark optimizing the 
function calls as if they are pure functions?

(Also, please check out my example --- the UDF there _should_ be a pure 
function.)

> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-29 Thread Ashish Shrowty (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534518#comment-15534518
 ] 

Ashish Shrowty commented on SPARK-17709:


Dilip,

I tried your code and it works on my end too. It's only when I try load an 
external table stored as parquet (in my case its stored in S3). Attaching stack 
trace if that helps (this time I tried on on a different table and hence the 
difference in column names) -

org.apache.spark.sql.AnalysisException: using columns ['productid] can not be 
resolved given input columns: [productid, name1, name2] ;
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:641)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:614)

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12666) spark-shell --packages cannot load artifacts which are publishLocal'd by SBT

2016-09-29 Thread Alexander Temerev (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534475#comment-15534475
 ] 

Alexander Temerev commented on SPARK-12666:
---

If you came here for a workaround for 2.0.0 (like I did), here it is: publish 
your artifacts using sbt publishM2 instead of publish-local. This way they will 
end up in local Maven (and not Ivy2) repository, which works fine.

> spark-shell --packages cannot load artifacts which are publishLocal'd by SBT
> 
>
> Key: SPARK-12666
> URL: https://issues.apache.org/jira/browse/SPARK-12666
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Bryan Cutler
> Fix For: 2.0.1, 2.1.0
>
>
> Symptom:
> I cloned the latest master of {{spark-redshift}}, then used {{sbt 
> publishLocal}} to publish it to my Ivy cache. When I tried running 
> {{./bin/spark-shell --packages 
> com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency 
> into {{spark-shell}}, I received the following cryptic error:
> {code}
> Exception in thread "main" java.lang.RuntimeException: [unresolved 
> dependency: com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: configuration 
> not found in com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: 'default'. It 
> was required from org.apache.spark#spark-submit-parent;1.0 default]
>   at 
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1009)
>   at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> I think the problem here is that Spark is declaring a dependency on the 
> spark-redshift artifact using the {{default}} Ivy configuration. Based on my 
> admittedly limited understanding of Ivy, the default configuration will be 
> the only configuration defined in an Ivy artifact if that artifact defines no 
> other configurations. Thus, for Maven artifacts I think the default 
> configuration will end up mapping to Maven's regular JAR dependency (i.e. 
> Maven artifacts don't declare Ivy configurations so they implicitly have the 
> {{default}} configuration) but for Ivy artifacts I think we can run into 
> trouble when loading artifacts which explicitly define their own 
> configurations, since those artifacts might not have a configuration named 
> {{default}}.
> I spent a bit of time playing around with the SparkSubmit code to see if I 
> could fix this but wasn't able to completely resolve the issue.
> /cc [~brkyvz] (ping me offline and I can walk you through the repo in person, 
> if you'd like)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-29 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534434#comment-15534434
 ] 

Dilip Biswal commented on SPARK-17709:
--

[~smilegator] Sure.

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-29 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534429#comment-15534429
 ] 

Xiao Li commented on SPARK-17709:
-

Can you try it in the latest 2.0?

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract

2016-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17738:


Assignee: Apache Spark  (was: Davies Liu)

> Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP 
> append/extract
> --
>
> Key: SPARK-17738
> URL: https://issues.apache.org/jira/browse/SPARK-17738
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract

2016-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17738:


Assignee: Davies Liu  (was: Apache Spark)

> Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP 
> append/extract
> --
>
> Key: SPARK-17738
> URL: https://issues.apache.org/jira/browse/SPARK-17738
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract

2016-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534426#comment-15534426
 ] 

Apache Spark commented on SPARK-17738:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/15305

> Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP 
> append/extract
> --
>
> Key: SPARK-17738
> URL: https://issues.apache.org/jira/browse/SPARK-17738
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-29 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534417#comment-15534417
 ] 

Dilip Biswal commented on SPARK-17709:
--

[~smilegator] Hi Sean, I tried it on my master branch and don't see the 
exception.

{code}
test("join issue") {
   withTable("tbl") {
 sql("CREATE TABLE tbl(key1 int, key2 int, totalprice int, itemcount int)")
 sql("insert into tbl values (1, 1, 1, 1)")
 val d1 = sql("select * from tbl")
 val df1 = d1.groupBy("key1","key2")
   .agg(avg("totalprice").as("avgtotalprice"))
 val df2 = d1.groupBy("key1","key2")
   .agg(avg("itemcount").as("avgqty"))
 df1.join(df2, Seq("key1","key2")).show()
   }
 }

Output

+++-+--+
|key1|key2|avgtotalprice|avgqty|
+++-+--+
|   1|   1|  1.0|   1.0|
+++-+--+
{code}



> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17737) cannot import name accumulators error

2016-09-29 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534383#comment-15534383
 ] 

Bryan Cutler commented on SPARK-17737:
--

What exactly are you trying to do?  The recommended way to create an 
Accumulator in PySpark is through the SparkContext as follows:

{noformat}
Using Python version 2.7.12 (default, Jul  1 2016 15:12:24)
SparkSession available as 'spark'.
>>> spark.sparkContext.accumulator(0)
>>> ac = spark.sparkContext.accumulator(0)
>>> ac

[jira] [Commented] (SPARK-17721) Erroneous computation in multiplication of transposed SparseMatrix with SparseVector

2016-09-29 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534375#comment-15534375
 ] 

Joseph K. Bradley commented on SPARK-17721:
---

OK I did an audit, and this will not have affected any algorithms in 2.0 or 
before.  But it will affect sparse logistic regression in 2.1!  Thanks for 
finding this bug.

If users have called Matrix.multiply directly, then they could be affected.

> Erroneous computation in multiplication of transposed SparseMatrix with 
> SparseVector
> 
>
> Key: SPARK-17721
> URL: https://issues.apache.org/jira/browse/SPARK-17721
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.0
> Environment: Verified on OS X with Spark 1.6.1 and on Databricks 
> running Spark 1.6.1
>Reporter: Bjarne Fruergaard
>Assignee: Bjarne Fruergaard
>Priority: Critical
>  Labels: correctness
> Fix For: 2.0.2, 2.1.0
>
>
> There is a bug in how a transposed SparseMatrix (isTransposed=true) does 
> multiplication with a SparseVector. The bug is present (for v. > 2.0.0) in 
> both org.apache.spark.mllib.linalg.BLAS (mllib) and 
> org.apache.spark.ml.linalg.BLAS (mllib-local) in the private gemv method with 
> signature:
> bq. gemv(alpha: Double, A: SparseMatrix, x: SparseVector, beta: Double, y: 
> DenseVector).
> This bug can be verified by running the following snippet in a Spark shell 
> (here using v1.6.1):
> {code:java}
> import com.holdenkarau.spark.testing.SharedSparkContext
> import org.apache.spark.mllib.linalg._
> val A = Matrices.dense(3, 2, Array[Double](0, 2, 1, 1, 2, 
> 0)).asInstanceOf[DenseMatrix].toSparse.transpose
> val b = Vectors.sparse(3, Seq[(Int, Double)]((1, 2), (2, 
> 1))).asInstanceOf[SparseVector]
> A.multiply(b)
> A.multiply(b.toDense)
> {code}
> The first multiply with the SparseMatrix returns the incorrect result:
> {code:java}
> org.apache.spark.mllib.linalg.DenseVector = [5.0,0.0]
> {code}
> whereas the correct result is returned by the second multiply:
> {code:java}
> org.apache.spark.mllib.linalg.DenseVector = [5.0,4.0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17412) FsHistoryProviderSuite fails if `root` user runs it

2016-09-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17412.

   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.1.0

> FsHistoryProviderSuite fails if `root` user runs it
> ---
>
> Key: SPARK-17412
> URL: https://issues.apache.org/jira/browse/SPARK-17412
> Project: Spark
>  Issue Type: Test
>  Components: Documentation, Tests
>Affects Versions: 1.6.1, 2.0.0
> Environment: jdk1.7, maven-3.3.9
>Reporter: Amita Chaudhary
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.1.0
>
>
> Failed at Spark Project Core :
> FsHistoryProviderSuite:
> - Parse new and old application logs
> - Parse legacy logs with compression codec set
> - SPARK-3697: ignore directories that cannot be read. *** FAILED ***
>   2 was not equal to 1 (FsHistoryProviderSuite.scala:198)
> - sendWithReply: remotely error
> - network events *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 800 times 
> over 5.004326759 seconds. Last failure message: 2 did not equal 1. 
> (RpcEnvSuite.scala:517)
> - network events between non-client-mode RpcEnvs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17412) FsHistoryProviderSuite fails if `root` user runs it

2016-09-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-17412:
---
Component/s: Documentation

> FsHistoryProviderSuite fails if `root` user runs it
> ---
>
> Key: SPARK-17412
> URL: https://issues.apache.org/jira/browse/SPARK-17412
> Project: Spark
>  Issue Type: Test
>  Components: Documentation, Tests
>Affects Versions: 1.6.1, 2.0.0
> Environment: jdk1.7, maven-3.3.9
>Reporter: Amita Chaudhary
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.1.0
>
>
> Failed at Spark Project Core :
> FsHistoryProviderSuite:
> - Parse new and old application logs
> - Parse legacy logs with compression codec set
> - SPARK-3697: ignore directories that cannot be read. *** FAILED ***
>   2 was not equal to 1 (FsHistoryProviderSuite.scala:198)
> - sendWithReply: remotely error
> - network events *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 800 times 
> over 5.004326759 seconds. Last failure message: 2 did not equal 1. 
> (RpcEnvSuite.scala:517)
> - network events between non-client-mode RpcEnvs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-29 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534338#comment-15534338
 ] 

Xiao Li commented on SPARK-17709:
-

Let me try to reproduce it. Thanks!

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17739) Collapse adjacent similar Window operators

2016-09-29 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17739:
--
Summary: Collapse adjacent similar Window operators  (was: Collapse 
adjacent similar Window operations.)

> Collapse adjacent similar Window operators
> --
>
> Key: SPARK-17739
> URL: https://issues.apache.org/jira/browse/SPARK-17739
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>
> Spark currently does not collapse adjacent windows with the same partitioning 
> and (similar) sorting. For example:
> {noformat}
> val df = spark.range(1000).select($"id" % 100 as "grp", $"id", rand() as 
> "col1", rand() as "col2")
> // Add summary statistics for all columns
> import org.apache.spark.sql.expressions.Window
> val cols = Seq("id", "col1", "col2")
> val window = Window.partitionBy($"grp").orderBy($"id")
> val result = cols.foldLeft(df) { (base, name) =>
>   base.withColumn(s"${name}_avg", avg(col(name)).over(window))
>   .withColumn(s"${name}_stddev", stddev(col(name)).over(window))
>   .withColumn(s"${name}_min", min(col(name)).over(window))
>   .withColumn(s"${name}_max", max(col(name)).over(window))
> }
> {noformat}
> Leads to following plan:
> {noformat}
> == Parsed Logical Plan ==
> 'Project [*, max('col2) windowspecdefinition('grp, 'id ASC NULLS FIRST, 
> UnspecifiedFrame) AS col2_max#10313]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295, col2_min#10295]
>   +- Window [min(col2#10098) windowspecdefinition(grp#10096L, id#10093L 
> ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 
> col2_min#10295], [grp#10096L], [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_stddev#10270]
>   +- Window [stddev_samp(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_stddev#10270], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
> +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
>+- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246, col2_avg#10246]
>   +- Window [avg(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_avg#10246], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
> +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
>+- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231, col1_max#10231]

[jira] [Created] (SPARK-17739) Collapse adjacent similar Window operations.

2016-09-29 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-17739:
-

 Summary: Collapse adjacent similar Window operations.
 Key: SPARK-17739
 URL: https://issues.apache.org/jira/browse/SPARK-17739
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Herman van Hovell


Spark currently does not collapse adjacent windows with the same partitioning 
and (similar) sorting. For example:
{noformat}
val df = spark.range(1000).select($"id" % 100 as "grp", $"id", rand() as 
"col1", rand() as "col2")

// Add summary statistics for all columns
import org.apache.spark.sql.expressions.Window
val cols = Seq("id", "col1", "col2")
val window = Window.partitionBy($"grp").orderBy($"id")
val result = cols.foldLeft(df) { (base, name) =>
  base.withColumn(s"${name}_avg", avg(col(name)).over(window))
  .withColumn(s"${name}_stddev", stddev(col(name)).over(window))
  .withColumn(s"${name}_min", min(col(name)).over(window))
  .withColumn(s"${name}_max", max(col(name)).over(window))
}
{noformat}

Leads to following plan:
{noformat}
== Parsed Logical Plan ==
'Project [*, max('col2) windowspecdefinition('grp, 'id ASC NULLS FIRST, 
UnspecifiedFrame) AS col2_max#10313]
+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
col2_stddev#10270, col2_min#10295]
   +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
col2_stddev#10270, col2_min#10295, col2_min#10295]
  +- Window [min(col2#10098) windowspecdefinition(grp#10096L, id#10093L ASC 
NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 
col2_min#10295], [grp#10096L], [id#10093L ASC NULLS FIRST]
 +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
col2_stddev#10270]
+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
col2_stddev#10270]
   +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
col2_stddev#10270, col2_stddev#10270]
  +- Window [stddev_samp(col2#10098) 
windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_stddev#10270], [grp#10096L], 
[id#10093L ASC NULLS FIRST]
 +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246]
+- Project [grp#10096L, id#10093L, col1#10097, 
col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
col2_avg#10246]
   +- Project [grp#10096L, id#10093L, col1#10097, 
col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
col2_avg#10246, col2_avg#10246]
  +- Window [avg(col2#10098) 
windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_avg#10246], [grp#10096L], 
[id#10093L ASC NULLS FIRST]
 +- Project [grp#10096L, id#10093L, col1#10097, 
col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231]
+- Project [grp#10096L, id#10093L, 
col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
col1_max#10231]
   +- Project [grp#10096L, id#10093L, 
col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
col1_max#10231, col1_max#10231]
  +- Window [max(col1#10097) 
windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_max#10231], [grp#10096L], 
[id#10093L ASC NULLS FIRST]
 +- Project [grp#10096L, id#10093L, 
col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
id_max#10165L,

[jira] [Updated] (SPARK-17721) Erroneous computation in multiplication of transposed SparseMatrix with SparseVector

2016-09-29 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17721:
--
Assignee: Bjarne Fruergaard

> Erroneous computation in multiplication of transposed SparseMatrix with 
> SparseVector
> 
>
> Key: SPARK-17721
> URL: https://issues.apache.org/jira/browse/SPARK-17721
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.0
> Environment: Verified on OS X with Spark 1.6.1 and on Databricks 
> running Spark 1.6.1
>Reporter: Bjarne Fruergaard
>Assignee: Bjarne Fruergaard
>Priority: Critical
>  Labels: correctness
> Fix For: 2.0.2, 2.1.0
>
>
> There is a bug in how a transposed SparseMatrix (isTransposed=true) does 
> multiplication with a SparseVector. The bug is present (for v. > 2.0.0) in 
> both org.apache.spark.mllib.linalg.BLAS (mllib) and 
> org.apache.spark.ml.linalg.BLAS (mllib-local) in the private gemv method with 
> signature:
> bq. gemv(alpha: Double, A: SparseMatrix, x: SparseVector, beta: Double, y: 
> DenseVector).
> This bug can be verified by running the following snippet in a Spark shell 
> (here using v1.6.1):
> {code:java}
> import com.holdenkarau.spark.testing.SharedSparkContext
> import org.apache.spark.mllib.linalg._
> val A = Matrices.dense(3, 2, Array[Double](0, 2, 1, 1, 2, 
> 0)).asInstanceOf[DenseMatrix].toSparse.transpose
> val b = Vectors.sparse(3, Seq[(Int, Double)]((1, 2), (2, 
> 1))).asInstanceOf[SparseVector]
> A.multiply(b)
> A.multiply(b.toDense)
> {code}
> The first multiply with the SparseMatrix returns the incorrect result:
> {code:java}
> org.apache.spark.mllib.linalg.DenseVector = [5.0,0.0]
> {code}
> whereas the correct result is returned by the second multiply:
> {code:java}
> org.apache.spark.mllib.linalg.DenseVector = [5.0,4.0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract

2016-09-29 Thread Davies Liu (JIRA)

Davies Liu created SPARK-17738:
--

 Summary: Flaky test: 
org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract
 Key: SPARK-17738
 URL: https://issues.apache.org/jira/browse/SPARK-17738
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu
Assignee: Davies Liu


https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17721) Erroneous computation in multiplication of transposed SparseMatrix with SparseVector

2016-09-29 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17721:
--
Fix Version/s: 2.1.0
   2.0.2

> Erroneous computation in multiplication of transposed SparseMatrix with 
> SparseVector
> 
>
> Key: SPARK-17721
> URL: https://issues.apache.org/jira/browse/SPARK-17721
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.0
> Environment: Verified on OS X with Spark 1.6.1 and on Databricks 
> running Spark 1.6.1
>Reporter: Bjarne Fruergaard
>Priority: Critical
>  Labels: correctness
> Fix For: 2.0.2, 2.1.0
>
>
> There is a bug in how a transposed SparseMatrix (isTransposed=true) does 
> multiplication with a SparseVector. The bug is present (for v. > 2.0.0) in 
> both org.apache.spark.mllib.linalg.BLAS (mllib) and 
> org.apache.spark.ml.linalg.BLAS (mllib-local) in the private gemv method with 
> signature:
> bq. gemv(alpha: Double, A: SparseMatrix, x: SparseVector, beta: Double, y: 
> DenseVector).
> This bug can be verified by running the following snippet in a Spark shell 
> (here using v1.6.1):
> {code:java}
> import com.holdenkarau.spark.testing.SharedSparkContext
> import org.apache.spark.mllib.linalg._
> val A = Matrices.dense(3, 2, Array[Double](0, 2, 1, 1, 2, 
> 0)).asInstanceOf[DenseMatrix].toSparse.transpose
> val b = Vectors.sparse(3, Seq[(Int, Double)]((1, 2), (2, 
> 1))).asInstanceOf[SparseVector]
> A.multiply(b)
> A.multiply(b.toDense)
> {code}
> The first multiply with the SparseMatrix returns the incorrect result:
> {code:java}
> org.apache.spark.mllib.linalg.DenseVector = [5.0,0.0]
> {code}
> whereas the correct result is returned by the second multiply:
> {code:java}
> org.apache.spark.mllib.linalg.DenseVector = [5.0,4.0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17676) FsHistoryProvider should ignore hidden files

2016-09-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17676.

   Resolution: Fixed
Fix Version/s: 2.1.0

> FsHistoryProvider should ignore hidden files
> 
>
> Key: SPARK-17676
> URL: https://issues.apache.org/jira/browse/SPARK-17676
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.1.0
>
>
> FsHistoryProvider currently reads hidden files (beginning with ".") from the 
> log dir.  However, it is writing a hidden file *itself* to that dir, which 
> cannot be parsed, as part of a trick to find the scan time according to the 
> file system:
> {code}
> val fileName = "." + UUID.randomUUID().toString
> val path = new Path(logDir, fileName)
> val fos = fs.create(path)
> {code}
> It does delete the tmp file immediately, but we've seen cases where that race 
> ends badly, and there is a logged error.  The error is harmless (the log file 
> is ignored and spark moves on to the other log files), but the logged error 
> is very confusing for users, so we should avoid it.
> {noformat}
> 2016-09-26 09:10:03,016 ERROR 
> org.apache.spark.deploy.history.FsHistoryProvider: Exception encountered when 
> attempting to load application log 
> hdfs://XXX/user/spark/applicationHistory/.3a5e987c-ace5-4568-9ccd-6285010e399a
>  
> java.lang.IllegalArgumentException: Codec 
> [3a5e987c-ace5-4568-9ccd-6285010e399a] is not available. Consider setting 
> spark.io.compression.codec=lzf 
> at 
> org.apache.spark.io.CompressionCodec$$anonfun$createCodec$1.apply(CompressionCodec.scala:72)
>  
> at 
> org.apache.spark.io.CompressionCodec$$anonfun$createCodec$1.apply(CompressionCodec.scala:72)
>  
> at scala.Option.getOrElse(Option.scala:120) 
> at 
> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:72) 
> at 
> org.apache.spark.scheduler.EventLoggingListener$$anonfun$8$$anonfun$apply$1.apply(EventLoggingListener.scala:309)
>  
> at 
> org.apache.spark.scheduler.EventLoggingListener$$anonfun$8$$anonfun$apply$1.apply(EventLoggingListener.scala:309)
>  
> at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189) 
> at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91) 
> at 
> org.apache.spark.scheduler.EventLoggingListener$$anonfun$8.apply(EventLoggingListener.scala:309)
>  
> at 
> org.apache.spark.scheduler.EventLoggingListener$$anonfun$8.apply(EventLoggingListener.scala:308)
>  
> at scala.Option.map(Option.scala:145) 
> at 
> org.apache.spark.scheduler.EventLoggingListener$.openEventLog(EventLoggingListener.scala:308)
>  
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:518)
>  
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:359)
>  
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:356)
>  
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>  
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) 
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) 
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:356)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$4.run(FsHistoryProvider.scala:277)
>  
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17721) Erroneous computation in multiplication of transposed SparseMatrix with SparseVector

2016-09-29 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17721:
--
Target Version/s: 1.5.3, 1.6.3, 2.0.2, 2.1.0

> Erroneous computation in multiplication of transposed SparseMatrix with 
> SparseVector
> 
>
> Key: SPARK-17721
> URL: https://issues.apache.org/jira/browse/SPARK-17721
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.0
> Environment: Verified on OS X with Spark 1.6.1 and on Databricks 
> running Spark 1.6.1
>Reporter: Bjarne Fruergaard
>Priority: Critical
>  Labels: correctness
>
> There is a bug in how a transposed SparseMatrix (isTransposed=true) does 
> multiplication with a SparseVector. The bug is present (for v. > 2.0.0) in 
> both org.apache.spark.mllib.linalg.BLAS (mllib) and 
> org.apache.spark.ml.linalg.BLAS (mllib-local) in the private gemv method with 
> signature:
> bq. gemv(alpha: Double, A: SparseMatrix, x: SparseVector, beta: Double, y: 
> DenseVector).
> This bug can be verified by running the following snippet in a Spark shell 
> (here using v1.6.1):
> {code:java}
> import com.holdenkarau.spark.testing.SharedSparkContext
> import org.apache.spark.mllib.linalg._
> val A = Matrices.dense(3, 2, Array[Double](0, 2, 1, 1, 2, 
> 0)).asInstanceOf[DenseMatrix].toSparse.transpose
> val b = Vectors.sparse(3, Seq[(Int, Double)]((1, 2), (2, 
> 1))).asInstanceOf[SparseVector]
> A.multiply(b)
> A.multiply(b.toDense)
> {code}
> The first multiply with the SparseMatrix returns the incorrect result:
> {code:java}
> org.apache.spark.mllib.linalg.DenseVector = [5.0,0.0]
> {code}
> whereas the correct result is returned by the second multiply:
> {code:java}
> org.apache.spark.mllib.linalg.DenseVector = [5.0,4.0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17721) Erroneous computation in multiplication of transposed SparseMatrix with SparseVector

2016-09-29 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17721:
--
Affects Version/s: (was: 1.6.1)
   (was: 1.4.0)
   1.4.1
   1.5.2
   1.6.2

> Erroneous computation in multiplication of transposed SparseMatrix with 
> SparseVector
> 
>
> Key: SPARK-17721
> URL: https://issues.apache.org/jira/browse/SPARK-17721
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.0
> Environment: Verified on OS X with Spark 1.6.1 and on Databricks 
> running Spark 1.6.1
>Reporter: Bjarne Fruergaard
>Priority: Critical
>  Labels: correctness
>
> There is a bug in how a transposed SparseMatrix (isTransposed=true) does 
> multiplication with a SparseVector. The bug is present (for v. > 2.0.0) in 
> both org.apache.spark.mllib.linalg.BLAS (mllib) and 
> org.apache.spark.ml.linalg.BLAS (mllib-local) in the private gemv method with 
> signature:
> bq. gemv(alpha: Double, A: SparseMatrix, x: SparseVector, beta: Double, y: 
> DenseVector).
> This bug can be verified by running the following snippet in a Spark shell 
> (here using v1.6.1):
> {code:java}
> import com.holdenkarau.spark.testing.SharedSparkContext
> import org.apache.spark.mllib.linalg._
> val A = Matrices.dense(3, 2, Array[Double](0, 2, 1, 1, 2, 
> 0)).asInstanceOf[DenseMatrix].toSparse.transpose
> val b = Vectors.sparse(3, Seq[(Int, Double)]((1, 2), (2, 
> 1))).asInstanceOf[SparseVector]
> A.multiply(b)
> A.multiply(b.toDense)
> {code}
> The first multiply with the SparseMatrix returns the incorrect result:
> {code:java}
> org.apache.spark.mllib.linalg.DenseVector = [5.0,0.0]
> {code}
> whereas the correct result is returned by the second multiply:
> {code:java}
> org.apache.spark.mllib.linalg.DenseVector = [5.0,4.0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17612) Support `DESCRIBE table PARTITION` SQL syntax

2016-09-29 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17612.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.1.0

Resolved per Dongjoon's PR

> Support `DESCRIBE table PARTITION` SQL syntax
> -
>
> Key: SPARK-17612
> URL: https://issues.apache.org/jira/browse/SPARK-17612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.1.0
>
>
> This issue implements `DESC PARTITION` SQL Syntax again. It was dropped since 
> Spark 2.0.0.
> h4. Spark 2.0.0
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res1: org.apache.spark.sql.DataFrame = []
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> org.apache.spark.sql.catalyst.parser.ParseException:
> Unsupported SQL statement
> == SQL ==
> DESC partitioned_table PARTITION (c='Us', d=1)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:58)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
>   ... 48 elided
> {code}
> h4. Spark 1.6.2
> {code}
> scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
> (c STRING, d STRING)")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
> res2: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
> 16/09/20 12:48:36 WARN LazyStruct: Extra bytes detected at the end of the 
> row! Ignoring similar problems.
> ++
> |result  |
> ++
> |a  string|
> |b  int   |
> |c  string|
> |d  string|
> ||
> |# Partition Information  
> |
> |# col_name data_type   comment |
> ||
> |c  string|
> |d  string|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-17725) Spark should not write out parquet files with schema containing non-nullable fields

2016-09-29 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-17725.
---
Resolution: Later

> Spark should not write out parquet files with schema containing non-nullable 
> fields
> ---
>
> Key: SPARK-17725
> URL: https://issues.apache.org/jira/browse/SPARK-17725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>
> Since Spark 1.3, after PR https://github.com/apache/spark/pull/4826 , Spark 
> SQL will always set all schema fields to nullable before writing out parquet 
> files, to make the data pipeline more robust.
> However, this behaviour has been changed in 2.0 accidently by PR 
> https://github.com/apache/spark/pull/11509



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17725) Spark should not write out parquet files with schema containing non-nullable fields

2016-09-29 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17725:

Target Version/s:   (was: 2.0.1)

> Spark should not write out parquet files with schema containing non-nullable 
> fields
> ---
>
> Key: SPARK-17725
> URL: https://issues.apache.org/jira/browse/SPARK-17725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>
> Since Spark 1.3, after PR https://github.com/apache/spark/pull/4826 , Spark 
> SQL will always set all schema fields to nullable before writing out parquet 
> files, to make the data pipeline more robust.
> However, this behaviour has been changed in 2.0 accidently by PR 
> https://github.com/apache/spark/pull/11509



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17725) Spark should not write out parquet files with schema containing non-nullable fields

2016-09-29 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534238#comment-15534238
 ] 

Dongjoon Hyun commented on SPARK-17725:
---

After sending a RC4 voting email, I found this issue is targeted to '2.0.1'.
Does this mean RC5?

> Spark should not write out parquet files with schema containing non-nullable 
> fields
> ---
>
> Key: SPARK-17725
> URL: https://issues.apache.org/jira/browse/SPARK-17725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>
> Since Spark 1.3, after PR https://github.com/apache/spark/pull/4826 , Spark 
> SQL will always set all schema fields to nullable before writing out parquet 
> files, to make the data pipeline more robust.
> However, this behaviour has been changed in 2.0 accidently by PR 
> https://github.com/apache/spark/pull/11509



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17737) cannot import name accumulators error

2016-09-29 Thread Pruthveej Reddy Kasarla (JIRA)

Pruthveej Reddy Kasarla created SPARK-17737:
---

 Summary: cannot import name accumulators error
 Key: SPARK-17737
 URL: https://issues.apache.org/jira/browse/SPARK-17737
 Project: Spark
  Issue Type: Question
  Components: PySpark
 Environment: unix
python 2.7
Reporter: Pruthveej Reddy Kasarla


Hi I am trying to setup my sparkcontext using the below code

import sys
sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python/build')
sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python')
from pyspark import SparkConf, SparkContext
sconf = SparkConf()
sc = SparkContext(conf=sconf)
print sc


got below error


ImportError   Traceback (most recent call last)
 in ()
  2 sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python/build')
  3 sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python')
> 4 from pyspark import SparkConf, SparkContext
  5 sconf = SparkConf()
  6 sc = SparkContext(conf=sconf)

/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/__init__.py in ()
 39 
 40 from pyspark.conf import SparkConf
---> 41 from pyspark.context import SparkContext
 42 from pyspark.rdd import RDD
 43 from pyspark.files import SparkFiles

/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py in ()
 26 from tempfile import NamedTemporaryFile
 27 
---> 28 from pyspark import accumulators
 29 from pyspark.accumulators import Accumulator
 30 from pyspark.broadcast import Broadcast

ImportError: cannot import name accumulators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17549) InMemoryRelation doesn't scale to large tables

2016-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534135#comment-15534135
 ] 

Apache Spark commented on SPARK-17549:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15304

> InMemoryRelation doesn't scale to large tables
> --
>
> Key: SPARK-17549
> URL: https://issues.apache.org/jira/browse/SPARK-17549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: create_parquet.scala, example_1.6_post_patch.png, 
> example_1.6_pre_patch.png, spark-1.6-2.patch, spark-1.6.patch, spark-2.0.patch
>
>
> An {{InMemoryRelation}} is created when you cache a table; but if the table 
> is large, defined by either having a really large amount of columns, or a 
> really large amount of partitions (in the file split sense, not the "table 
> partition" sense), or both, it causes an immense amount of memory to be used 
> in the driver.
> The reason is that it uses an accumulator to collect statistics about each 
> partition, and instead of summarizing the data in the driver, it keeps *all* 
> entries in memory.
> I'm attaching a script I used to create a parquet file with 20,000 columns 
> and a single row, which I then copied 500 times so I'd have 500 partitions.
> When doing the following:
> {code}
> sqlContext.read.parquet(...).count()
> {code}
> Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the 
> settings I used, but it works.)
> I ran spark-shell like this:
> {code}
> ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g 
> --conf spark.executor.memory=2g
> {code}
> And ran:
> {code}
> sqlContext.read.parquet(...).cache().count()
> {code}
> You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 
> partitions were processed, there were 40 GenericInternalRow objects with
> 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage 
> was:
> {code}
>   40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB
> {code}
> (Note: Integer = 20 bytes, Long = 24 bytes.)
> If I waited until the end, there would be 500 partitions, so ~ 5GB of memory 
> to hold the stats.
> I'm also attaching a patch I made on top of 1.6 that uses just a long 
> accumulator to capture the table size; with that patch memory usage on the 
> driver doesn't keep growing. Also note in the patch that I'm multiplying the 
> column size by the row count, which I think is a different bug in the 
> existing code (those stats should be for the whole batch, not just a single 
> row, right?). I also added {{example_1.6_post_patch.png}} to show the 
> {{InMemoryRelation}} with the patch.
> I also applied a very similar patch on top of Spark 2.0. But there things 
> blow up even more spectacularly when I try to run the count on the cached 
> table. It starts with this error:
> {noformat}
> 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, 
> vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: 
> java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: 
> Index: 63235, Size: 1
> (lots of generated code here...)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
>   at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
>   at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
>   at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
>   at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883)
>   ... 54 more
>

[jira] [Commented] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-29 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534127#comment-15534127
 ] 

Josh Rosen commented on SPARK-17733:


Here's an even simpler test case:

{code}
sql("""CREATE TEMPORARY VIEW foo(a) AS VALUES (CAST(-993 AS BIGINT))""")

sql("""
SELECT
*
FROM (
SELECT
COALESCE(t1.a, t2.a) AS int_col,
t1.a,
t2.a AS b
FROM foo t1
CROSS JOIN foo t2
) t1
INNER JOIN foo t2 ON (((t2.a) = (t1.a)) AND ((t2.a) = (t1.int_col))) 
AND ((t2.a) = (t1.b))
   """)
{code}

> InferFiltersFromConstraints rule never terminates for query
> ---
>
> Key: SPARK-17733
> URL: https://issues.apache.org/jira/browse/SPARK-17733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Critical
> Attachments: 
> SparkSubmit-2016-09-29-1_snapshot___Users_joshrosen_Snapshots__-_YourKit_Java_Profiler_2013_build_13088_-_64-bit.png,
>  constraints.png
>
>
> The following (complicated) example becomes stuck in the 
> {{InferFiltersFromConstraints}} rule and never runs. However, it doesn't fail 
> with a stack overflow and doesn't hit the limit on optimization passes, so I 
> think there's some sort of non-obvious infinite loop within the rule itself.
> {code:title=Table Creation|borderStyle=solid}
>  -- Query #0
> CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, decimal2610_col_3, 
> boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, timestamp_col_8) 
> AS VALUES
>   (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
> TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
> TIMESTAMP('2007-01-13 00:00:00.0')),
>   (CAST(722.4906 AS FLOAT), true, 497.54BD, true, 
> TIMESTAMP('2015-12-14 00:00:00.0'), false, CAST(268 AS BIGINT), 
> TIMESTAMP('2021-04-19 00:00:00.0')),
>   (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
> TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
> TIMESTAMP('2019-10-16 00:00:00.0')),
>   (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
> TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462 AS BIGINT), CAST(NULL AS 
> TIMESTAMP)),
>   (CAST(46.395535 AS FLOAT), true, -662.89BD, true, 
> TIMESTAMP('2000-10-16 00:00:00.0'), false, CAST(-656 AS BIGINT), 
> TIMESTAMP('2024-09-01 00:00:00.0')),
>   (CAST(-555.36285 AS FLOAT), true, -938.93BD, true, 
> TIMESTAMP('2007-04-10 00:00:00.0'), true, CAST(252 AS BIGINT), 
> TIMESTAMP('2028-12-03 00:00:00.0')),
>   (CAST(826.29004 AS FLOAT), true, 53.18BD, false, 
> TIMESTAMP('2004-06-11 00:00:00.0'), false, CAST(437 AS BIGINT), 
> TIMESTAMP('1994-04-04 00:00:00.0')),
>   (CAST(-15.276999 AS FLOAT), CAST(NULL AS BOOLEAN), -889.31BD, true, 
> TIMESTAMP('1991-05-23 00:00:00.0'), true, CAST(226 AS BIGINT), 
> TIMESTAMP('2023-07-08 00:00:00.0')),
>   (CAST(385.27386 AS FLOAT), CAST(NULL AS BOOLEAN), -9.95BD, false, 
> TIMESTAMP('2022-10-22 00:00:00.0'), true, CAST(430 AS BIGINT), 
> TIMESTAMP('2013-09-29 00:00:00.0')),
>   (CAST(988.7868 AS FLOAT), CAST(NULL AS BOOLEAN), 715.17BD, false, 
> TIMESTAMP('2026-10-03 00:00:00.0'), true, CAST(-696 AS BIGINT), 
> TIMESTAMP('1990-08-10 00:00:00.0'))
>  ;
>  -- Query #1
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS VALUES
>   (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
>   (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 00:00:00.0'), 
> CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
>   (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 00:00:00.0'), 
> CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 00:00:00.0'), 
> '211', -959, CAST(NULL AS STRING)),
>   (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
>   (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 00:00:00.0'), 
> CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, TIMESTAMP('2028-06-27 
> 00:00:00.0'), '-657', 948, '18'),
>   (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 00:00:00.0'), 
> CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 00:00:00.0'), 
> '-345', 566, '-574'),
>   (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 00:00:00.0'), 
> CAST(972 AS SMALLINT), true, CAST(NULL AS INT),

[jira] [Created] (SPARK-17736) Update R README for rmarkdown, pandoc

2016-09-29 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-17736:
-

 Summary: Update R README for rmarkdown, pandoc
 Key: SPARK-17736
 URL: https://issues.apache.org/jira/browse/SPARK-17736
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, SparkR
Reporter: Joseph K. Bradley


To build R docs (which are built when R tests are run), users need to install 
pandoc and rmarkdown.  This was done for Jenkins in [SPARK-17420].

We should document these dependencies here:
[https://github.com/apache/spark/blob/master/docs/README.md#prerequisites]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)

2016-09-29 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17653.
---
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.1.0

Resolved per Viirya's PR

> Optimizer should remove unnecessary distincts (in multiple unions)
> --
>
> Key: SPARK-17653
> URL: https://issues.apache.org/jira/browse/SPARK-17653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Reynold Xin
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> Query:
> {code}
> select 1 a union select 2 b union select 3 c
> {code}
> Explain plan:
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#13], functions=[])
> +- Exchange hashpartitioning(a#13, 200)
>+- *HashAggregate(keys=[a#13], functions=[])
>   +- Union
>  :- *HashAggregate(keys=[a#13], functions=[])
>  :  +- Exchange hashpartitioning(a#13, 200)
>  : +- *HashAggregate(keys=[a#13], functions=[])
>  :+- Union
>  :   :- *Project [1 AS a#13]
>  :   :  +- Scan OneRowRelation[]
>  :   +- *Project [2 AS b#14]
>  :  +- Scan OneRowRelation[]
>  +- *Project [3 AS c#15]
> +- Scan OneRowRelation[]
> {code}
> Only one distinct should be necessary. This makes a bunch of unions slower 
> than a bunch of union alls followed by a distinct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-29 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534082#comment-15534082
 ] 

Josh Rosen edited comment on SPARK-17733 at 9/29/16 9:30 PM:
-

I managed to shrink to a smaller case which freezes {{explain}}:

{code}
sql("""
CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, 
decimal2610_col_3, boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, 
timestamp_col_8) AS VALUES
  (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
TIMESTAMP('2007-01-13 00:00:00.0'))
""")

sql("""
CREATE TEMPORARY VIEW table_2(bigint_col_1, boolean_col_2, double_col_3, 
double_col_4, double_col_5, varchar0164_col_6) AS VALUES
  (CAST(-374 AS BIGINT), CAST(NULL AS BOOLEAN), CAST(939.626553676 AS DOUBLE), 
CAST(-777.275379746 AS DOUBLE), CAST(235.613760023 AS DOUBLE), '86')
  """)

sql("""
SELECT
*
FROM (
SELECT
COALESCE(t1.bigint_col_7, t2.bigint_col_7) AS int_col,
t1.bigint_col_7,
t2.bigint_col_7 AS int_col_1
FROM table_4 t1
CROSS JOIN table_4 t2
) t1
INNER JOIN table_2 t2 ON (((t2.bigint_col_1) = (t1.bigint_col_7)) AND 
((t2.bigint_col_1) = (t1.int_col))) AND ((t2.bigint_col_1) = (t1.int_col_1))
   """)
{code}


was (Author: joshrosen):
I managed to shrink to a smaller case which freezes {{explain}}:

{code}
sql("""
CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, 
decimal2610_col_3, boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, 
timestamp_col_8) AS VALUES
  (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
TIMESTAMP('2007-01-13 00:00:00.0')),
  (CAST(722.4906 AS FLOAT), true, 497.54BD, true, TIMESTAMP('2015-12-14 
00:00:00.0'), false, CAST(268 AS BIGINT), TIMESTAMP('2021-04-19 00:00:00.0')),
  (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
TIMESTAMP('2019-10-16 00:00:00.0')),
  (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462 AS BIGINT), CAST(NULL AS 
TIMESTAMP)),
  (CAST(46.395535 AS FLOAT), true, -662.89BD, true, 
TIMESTAMP('2000-10-16 00:00:00.0'), false, CAST(-656 AS BIGINT), 
TIMESTAMP('2024-09-01 00:00:00.0')),
  (CAST(-555.36285 AS FLOAT), true, -938.93BD, true, 
TIMESTAMP('2007-04-10 00:00:00.0'), true, CAST(252 AS BIGINT), 
TIMESTAMP('2028-12-03 00:00:00.0')),
  (CAST(826.29004 AS FLOAT), true, 53.18BD, false, 
TIMESTAMP('2004-06-11 00:00:00.0'), false, CAST(437 AS BIGINT), 
TIMESTAMP('1994-04-04 00:00:00.0')),
  (CAST(-15.276999 AS FLOAT), CAST(NULL AS BOOLEAN), -889.31BD, true, 
TIMESTAMP('1991-05-23 00:00:00.0'), true, CAST(226 AS BIGINT), 
TIMESTAMP('2023-07-08 00:00:00.0')),
  (CAST(385.27386 AS FLOAT), CAST(NULL AS BOOLEAN), -9.95BD, false, 
TIMESTAMP('2022-10-22 00:00:00.0'), true, CAST(430 AS BIGINT), 
TIMESTAMP('2013-09-29 00:00:00.0')),
  (CAST(988.7868 AS FLOAT), CAST(NULL AS BOOLEAN), 715.17BD, false, 
TIMESTAMP('2026-10-03 00:00:00.0'), true, CAST(-696 AS BIGINT), 
TIMESTAMP('1990-08-10 00:00:00.0'))
""")

sql("""
CREATE TEMPORARY VIEW table_2(bigint_col_1, boolean_col_2, double_col_3, 
double_col_4, double_col_5, varchar0164_col_6) AS VALUES
  (CAST(-374 AS BIGINT), CAST(NULL AS BOOLEAN), CAST(939.626553676 AS DOUBLE), 
CAST(-777.275379746 AS DOUBLE), CAST(235.613760023 AS DOUBLE), '86'),
  (CAST(324 AS BIGINT), true, CAST(-507.23760783 AS DOUBLE), CAST(NULL AS 
DOUBLE), CAST(966.753434439 AS DOUBLE), '304'),
  (CAST(882 AS BIGINT), false, CAST(-366.529706229 AS DOUBLE), 
CAST(787.000491043 AS DOUBLE), CAST(-331.333188698 AS DOUBLE), '158'),
  (CAST(-510 AS BIGINT), CAST(NULL AS BOOLEAN), CAST(-855.344932257 AS DOUBLE), 
CAST(-858.167264921 AS DOUBLE), CAST(NULL AS DOUBLE), '-419'),
  (CAST(-13 AS BIGINT), false, CAST(589.966987492 AS DOUBLE), CAST(NULL AS 
DOUBLE), CAST(-653.515783257 AS DOUBLE), '970'),
  (CAST(-361 AS BIGINT), true, CAST(-413.021011259 AS DOUBLE), 
CAST(-716.638705947 AS DOUBLE), CAST(-936.480108205 AS DOUBLE), '807'),
  (CAST(815 AS BIGINT), true, CAST(-643.690268711 AS DOUBLE), 
CAST(-684.206112496 AS DOUBLE), CAST(335.557479371 AS DOUBLE), '-872'),
  (CAST(617 AS BIGINT), true, CAST(-93.3806447556 AS DOUBLE), 
CAST(-322.66171021 AS DOUBLE), CAST(-951.18299435 AS DOUBLE), '-167'),
  (CAST(-876 AS BIGINT), false, CAST(-481.774062168 AS DOUBLE), 
CAST(-204.40537387 AS DOUBLE), CAST(224.889845986 AS DOUBLE), '-986'),
  (CAST(2 AS BIGINT), false, CAST(462.843898322 AS DOUBLE), CAST(-9.85549856798 
AS DOUBLE), CAST(-549.875829922 AS DOUBLE), '121')
  """)

sql("""
SELECT
*
FROM (
SELECT
COALESCE(t1.bigint_col_7, t2.bigint_col_7) AS int_col,

[jira] [Comment Edited] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-29 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534082#comment-15534082
 ] 

Josh Rosen edited comment on SPARK-17733 at 9/29/16 9:28 PM:
-

I managed to shrink to a smaller case which freezes {{explain}}:

{code}
sql("""
CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, 
decimal2610_col_3, boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, 
timestamp_col_8) AS VALUES
  (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
TIMESTAMP('2007-01-13 00:00:00.0')),
  (CAST(722.4906 AS FLOAT), true, 497.54BD, true, TIMESTAMP('2015-12-14 
00:00:00.0'), false, CAST(268 AS BIGINT), TIMESTAMP('2021-04-19 00:00:00.0')),
  (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
TIMESTAMP('2019-10-16 00:00:00.0')),
  (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462 AS BIGINT), CAST(NULL AS 
TIMESTAMP)),
  (CAST(46.395535 AS FLOAT), true, -662.89BD, true, 
TIMESTAMP('2000-10-16 00:00:00.0'), false, CAST(-656 AS BIGINT), 
TIMESTAMP('2024-09-01 00:00:00.0')),
  (CAST(-555.36285 AS FLOAT), true, -938.93BD, true, 
TIMESTAMP('2007-04-10 00:00:00.0'), true, CAST(252 AS BIGINT), 
TIMESTAMP('2028-12-03 00:00:00.0')),
  (CAST(826.29004 AS FLOAT), true, 53.18BD, false, 
TIMESTAMP('2004-06-11 00:00:00.0'), false, CAST(437 AS BIGINT), 
TIMESTAMP('1994-04-04 00:00:00.0')),
  (CAST(-15.276999 AS FLOAT), CAST(NULL AS BOOLEAN), -889.31BD, true, 
TIMESTAMP('1991-05-23 00:00:00.0'), true, CAST(226 AS BIGINT), 
TIMESTAMP('2023-07-08 00:00:00.0')),
  (CAST(385.27386 AS FLOAT), CAST(NULL AS BOOLEAN), -9.95BD, false, 
TIMESTAMP('2022-10-22 00:00:00.0'), true, CAST(430 AS BIGINT), 
TIMESTAMP('2013-09-29 00:00:00.0')),
  (CAST(988.7868 AS FLOAT), CAST(NULL AS BOOLEAN), 715.17BD, false, 
TIMESTAMP('2026-10-03 00:00:00.0'), true, CAST(-696 AS BIGINT), 
TIMESTAMP('1990-08-10 00:00:00.0'))
""")

sql("""
CREATE TEMPORARY VIEW table_2(bigint_col_1, boolean_col_2, double_col_3, 
double_col_4, double_col_5, varchar0164_col_6) AS VALUES
  (CAST(-374 AS BIGINT), CAST(NULL AS BOOLEAN), CAST(939.626553676 AS DOUBLE), 
CAST(-777.275379746 AS DOUBLE), CAST(235.613760023 AS DOUBLE), '86'),
  (CAST(324 AS BIGINT), true, CAST(-507.23760783 AS DOUBLE), CAST(NULL AS 
DOUBLE), CAST(966.753434439 AS DOUBLE), '304'),
  (CAST(882 AS BIGINT), false, CAST(-366.529706229 AS DOUBLE), 
CAST(787.000491043 AS DOUBLE), CAST(-331.333188698 AS DOUBLE), '158'),
  (CAST(-510 AS BIGINT), CAST(NULL AS BOOLEAN), CAST(-855.344932257 AS DOUBLE), 
CAST(-858.167264921 AS DOUBLE), CAST(NULL AS DOUBLE), '-419'),
  (CAST(-13 AS BIGINT), false, CAST(589.966987492 AS DOUBLE), CAST(NULL AS 
DOUBLE), CAST(-653.515783257 AS DOUBLE), '970'),
  (CAST(-361 AS BIGINT), true, CAST(-413.021011259 AS DOUBLE), 
CAST(-716.638705947 AS DOUBLE), CAST(-936.480108205 AS DOUBLE), '807'),
  (CAST(815 AS BIGINT), true, CAST(-643.690268711 AS DOUBLE), 
CAST(-684.206112496 AS DOUBLE), CAST(335.557479371 AS DOUBLE), '-872'),
  (CAST(617 AS BIGINT), true, CAST(-93.3806447556 AS DOUBLE), 
CAST(-322.66171021 AS DOUBLE), CAST(-951.18299435 AS DOUBLE), '-167'),
  (CAST(-876 AS BIGINT), false, CAST(-481.774062168 AS DOUBLE), 
CAST(-204.40537387 AS DOUBLE), CAST(224.889845986 AS DOUBLE), '-986'),
  (CAST(2 AS BIGINT), false, CAST(462.843898322 AS DOUBLE), CAST(-9.85549856798 
AS DOUBLE), CAST(-549.875829922 AS DOUBLE), '121')
  """)

sql("""
SELECT
*
FROM (
SELECT
COALESCE(t1.bigint_col_7, t2.bigint_col_7) AS int_col,
t1.bigint_col_7,
t2.bigint_col_7 AS int_col_1
FROM table_4 t1
CROSS JOIN table_4 t2
) t1
INNER JOIN table_2 t2 ON (((t2.bigint_col_1) = (t1.bigint_col_7)) AND 
((t2.bigint_col_1) = (t1.int_col))) AND ((t2.bigint_col_1) = (t1.int_col_1))
   """)
{code}


was (Author: joshrosen):
I managed to shrink to a smaller case which freezes {{explain}}:

{code}
sql("""
CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, 
decimal2610_col_3, boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, 
timestamp_col_8) AS VALUES
  (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
TIMESTAMP('2007-01-13 00:00:00.0')),
  (CAST(722.4906 AS FLOAT), true, 497.54BD, true, TIMESTAMP('2015-12-14 
00:00:00.0'), false, CAST(268 AS BIGINT), TIMESTAMP('2021-04-19 00:00:00.0')),
  (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
TIMESTAMP('2019-10-16 00:00:00.0')),
  (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462

[jira] [Commented] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-29 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534082#comment-15534082
 ] 

Josh Rosen commented on SPARK-17733:


I managed to shrink to a smaller case which freezes {{explain}}:

{code}
sql("""
CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, 
decimal2610_col_3, boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, 
timestamp_col_8) AS VALUES
  (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
TIMESTAMP('2007-01-13 00:00:00.0')),
  (CAST(722.4906 AS FLOAT), true, 497.54BD, true, TIMESTAMP('2015-12-14 
00:00:00.0'), false, CAST(268 AS BIGINT), TIMESTAMP('2021-04-19 00:00:00.0')),
  (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
TIMESTAMP('2019-10-16 00:00:00.0')),
  (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462 AS BIGINT), CAST(NULL AS 
TIMESTAMP)),
  (CAST(46.395535 AS FLOAT), true, -662.89BD, true, 
TIMESTAMP('2000-10-16 00:00:00.0'), false, CAST(-656 AS BIGINT), 
TIMESTAMP('2024-09-01 00:00:00.0')),
  (CAST(-555.36285 AS FLOAT), true, -938.93BD, true, 
TIMESTAMP('2007-04-10 00:00:00.0'), true, CAST(252 AS BIGINT), 
TIMESTAMP('2028-12-03 00:00:00.0')),
  (CAST(826.29004 AS FLOAT), true, 53.18BD, false, 
TIMESTAMP('2004-06-11 00:00:00.0'), false, CAST(437 AS BIGINT), 
TIMESTAMP('1994-04-04 00:00:00.0')),
  (CAST(-15.276999 AS FLOAT), CAST(NULL AS BOOLEAN), -889.31BD, true, 
TIMESTAMP('1991-05-23 00:00:00.0'), true, CAST(226 AS BIGINT), 
TIMESTAMP('2023-07-08 00:00:00.0')),
  (CAST(385.27386 AS FLOAT), CAST(NULL AS BOOLEAN), -9.95BD, false, 
TIMESTAMP('2022-10-22 00:00:00.0'), true, CAST(430 AS BIGINT), 
TIMESTAMP('2013-09-29 00:00:00.0')),
  (CAST(988.7868 AS FLOAT), CAST(NULL AS BOOLEAN), 715.17BD, false, 
TIMESTAMP('2026-10-03 00:00:00.0'), true, CAST(-696 AS BIGINT), 
TIMESTAMP('1990-08-10 00:00:00.0'))
""")

sql("""
CREATE TEMPORARY VIEW table_2(bigint_col_1, boolean_col_2, double_col_3, 
double_col_4, double_col_5, varchar0164_col_6) AS VALUES
  (CAST(-374 AS BIGINT), CAST(NULL AS BOOLEAN), CAST(939.626553676 AS DOUBLE), 
CAST(-777.275379746 AS DOUBLE), CAST(235.613760023 AS DOUBLE), '86'),
  (CAST(324 AS BIGINT), true, CAST(-507.23760783 AS DOUBLE), CAST(NULL AS 
DOUBLE), CAST(966.753434439 AS DOUBLE), '304'),
  (CAST(882 AS BIGINT), false, CAST(-366.529706229 AS DOUBLE), 
CAST(787.000491043 AS DOUBLE), CAST(-331.333188698 AS DOUBLE), '158'),
  (CAST(-510 AS BIGINT), CAST(NULL AS BOOLEAN), CAST(-855.344932257 AS DOUBLE), 
CAST(-858.167264921 AS DOUBLE), CAST(NULL AS DOUBLE), '-419'),
  (CAST(-13 AS BIGINT), false, CAST(589.966987492 AS DOUBLE), CAST(NULL AS 
DOUBLE), CAST(-653.515783257 AS DOUBLE), '970'),
  (CAST(-361 AS BIGINT), true, CAST(-413.021011259 AS DOUBLE), 
CAST(-716.638705947 AS DOUBLE), CAST(-936.480108205 AS DOUBLE), '807'),
  (CAST(815 AS BIGINT), true, CAST(-643.690268711 AS DOUBLE), 
CAST(-684.206112496 AS DOUBLE), CAST(335.557479371 AS DOUBLE), '-872'),
  (CAST(617 AS BIGINT), true, CAST(-93.3806447556 AS DOUBLE), 
CAST(-322.66171021 AS DOUBLE), CAST(-951.18299435 AS DOUBLE), '-167'),
  (CAST(-876 AS BIGINT), false, CAST(-481.774062168 AS DOUBLE), 
CAST(-204.40537387 AS DOUBLE), CAST(224.889845986 AS DOUBLE), '-986'),
  (CAST(2 AS BIGINT), false, CAST(462.843898322 AS DOUBLE), CAST(-9.85549856798 
AS DOUBLE), CAST(-549.875829922 AS DOUBLE), '121')
  """)

sql("""
SELECT
*
FROM (
SELECT
COALESCE(t1.bigint_col_7, t2.bigint_col_7) AS int_col,
t1.bigint_col_7,
t2.bigint_col_7 AS int_col_1
FROM table_4 t1
INNER JOIN table_4 t2 ON true
) t1
INNER JOIN table_2 t2 ON (((t2.bigint_col_1) = (t1.bigint_col_7)) AND 
((t2.bigint_col_1) = (t1.int_col))) AND ((t2.bigint_col_1) = (t1.int_col_1))
   """)
{code}

> InferFiltersFromConstraints rule never terminates for query
> ---
>
> Key: SPARK-17733
> URL: https://issues.apache.org/jira/browse/SPARK-17733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Critical
> Attachments: 
> SparkSubmit-2016-09-29-1_snapshot___Users_joshrosen_Snapshots__-_YourKit_Java_Profiler_2013_build_13088_-_64-bit.png,
>  constraints.png
>
>
> The following (complicated) example becomes stuck in the 
> {{InferFiltersFromConstraints}} rule and never runs. However, it doesn't fail 
> with a stack overflow and doesn't hit the limit on optimization passes, so I 
> think there's some sort of non-obvious infinite loop within the rule itself.
> {code:title=Table

[jira] [Commented] (SPARK-17735) Cannot call sqlContext inside udf

2016-09-29 Thread Saif Addin Ellafi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534073#comment-15534073
 ] 

Saif Addin Ellafi commented on SPARK-17735:
---

Appreciate.

That's exactly what I ended up doing. But I've been stuck for a couple days 
since I was getting misleading error messages.

Saif

> Cannot call sqlContext inside udf
> -
>
> Key: SPARK-17735
> URL: https://issues.apache.org/jira/browse/SPARK-17735
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Saif Addin Ellafi
>Priority: Minor
>
> Hello, I know it is a strange use case but I just wanted to append is cache 
> to hive tables and turned out into this
> {noformat}
> def inCacheUdf = org.apache.spark.sql.functions.udf {name: String => 
> sqlContext.isCached(name)}
> val ts = sqlContext.tables
> ts.withColumn("foo", inCacheUdf(ts.col("tableName"))).show
> {noformat}
> .. will throw a Null Pointer Exception.
> I am not sure, since in my program I was getting:
> A master url must be set in your configuration
> or NullPointer at getConf as well.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17735) Cannot call sqlContext inside udf

2016-09-29 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534069#comment-15534069
 ] 

Herman van Hovell commented on SPARK-17735:
---

It it better to completely do this on the driver side. UDFs are executed on 
executors, which does not know anything about Hive. You could create a Seq and 
turn that into a table once you have determined the isTemp & isCached flags.



> Cannot call sqlContext inside udf
> -
>
> Key: SPARK-17735
> URL: https://issues.apache.org/jira/browse/SPARK-17735
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Saif Addin Ellafi
>Priority: Minor
>
> Hello, I know it is a strange use case but I just wanted to append is cache 
> to hive tables and turned out into this
> {noformat}
> def inCacheUdf = org.apache.spark.sql.functions.udf {name: String => 
> sqlContext.isCached(name)}
> val ts = sqlContext.tables
> ts.withColumn("foo", inCacheUdf(ts.col("tableName"))).show
> {noformat}
> .. will throw a Null Pointer Exception.
> I am not sure, since in my program I was getting:
> A master url must be set in your configuration
> or NullPointer at getConf as well.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17728) UDFs are run too many times

2016-09-29 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534059#comment-15534059
 ] 

Herman van Hovell commented on SPARK-17728:
---

You really should not try to use any external state in a UDF (it should be a 
pure function).

It might be an idea to use a generator in this case. These are guaranteed to 
only execute once for an input tuple.


> UDFs are run too many times
> ---
>
> Key: SPARK-17728
> URL: https://issues.apache.org/jira/browse/SPARK-17728
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Databricks Cloud / Spark 2.0.0
>Reporter: Jacob Eisinger
>Priority: Minor
> Attachments: over_optimized_udf.html
>
>
> h3. Background
> Llonger running processes that might run analytics or contact external 
> services from UDFs. The response might not just be a field, but instead a 
> structure of information. When attempting to break out this information, it 
> is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed *multiple times* _per row._
> h3. Expected Results
> The UDF should only be executed *once* _per row._
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> For code and more details, see [^over_optimized_udf.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17718) Update MLib Classification Documentation

2016-09-29 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17718:
--
Issue Type: Documentation  (was: Improvement)

> Update MLib Classification Documentation 
> -
>
> Key: SPARK-17718
> URL: https://issues.apache.org/jira/browse/SPARK-17718
> Project: Spark
>  Issue Type: Documentation
>Reporter: Tobi Bosede
>Priority: Minor
>
> https://spark.apache.org/docs/1.6.0/mllib-linear-methods.html#mjx-eqn-eqregPrimal
> The loss function here for logistic regression is confusing. It seems to 
> imply that spark uses only -1 and 1 class labels. However it uses 0,1.  Note 
> below needs to make this point more visible to avoid confusion.
> "Note that, in the mathematical formulation in this guide, a binary label
> y is denoted as either +1 (positive) or −1 (negative), which is convenient
> for the formulation. However, the negative label is represented by 0 in
> spark.mllib instead of −1, to be consistent with multiclass labeling."
> Better yet, the loss function should be replaced with that for 0, 1 despite 
> mathematical inconvenience, since that is what is actually implemented. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17735) Cannot call sqlContext inside udf

2016-09-29 Thread Saif Addin Ellafi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534046#comment-15534046
 ] 

Saif Addin Ellafi commented on SPARK-17735:
---

Hello, thanks

I needed to know not only if hive tables were temporary, but also if they were 
or not in Cache. Final format is a dataframe just  to show results as a table 
later on.
I can think of alternative ways, not changing any state specifically inside the 
udf.

> Cannot call sqlContext inside udf
> -
>
> Key: SPARK-17735
> URL: https://issues.apache.org/jira/browse/SPARK-17735
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Saif Addin Ellafi
>Priority: Minor
>
> Hello, I know it is a strange use case but I just wanted to append is cache 
> to hive tables and turned out into this
> {noformat}
> def inCacheUdf = org.apache.spark.sql.functions.udf {name: String => 
> sqlContext.isCached(name)}
> val ts = sqlContext.tables
> ts.withColumn("foo", inCacheUdf(ts.col("tableName"))).show
> {noformat}
> .. will throw a Null Pointer Exception.
> I am not sure, since in my program I was getting:
> A master url must be set in your configuration
> or NullPointer at getConf as well.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-17735) Cannot call sqlContext inside udf

2016-09-29 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-17735.
-
Resolution: Not A Problem

> Cannot call sqlContext inside udf
> -
>
> Key: SPARK-17735
> URL: https://issues.apache.org/jira/browse/SPARK-17735
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Saif Addin Ellafi
>Priority: Minor
>
> Hello, I know it is a strange use case but I just wanted to append is cache 
> to hive tables and turned out into this
> {noformat}
> def inCacheUdf = org.apache.spark.sql.functions.udf {name: String => 
> sqlContext.isCached(name)}
> val ts = sqlContext.tables
> ts.withColumn("foo", inCacheUdf(ts.col("tableName"))).show
> {noformat}
> .. will throw a Null Pointer Exception.
> I am not sure, since in my program I was getting:
> A master url must be set in your configuration
> or NullPointer at getConf as well.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17735) Cannot call sqlContext inside udf

2016-09-29 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17735:
--
Description: 
Hello, I know it is a strange use case but I just wanted to append is cache to 
hive tables and turned out into this
{noformat}
def inCacheUdf = org.apache.spark.sql.functions.udf {name: String => 
sqlContext.isCached(name)}

val ts = sqlContext.tables

ts.withColumn("foo", inCacheUdf(ts.col("tableName"))).show
{noformat}
.. will throw a Null Pointer Exception.

I am not sure, since in my program I was getting:

A master url must be set in your configuration

or NullPointer at getConf as well.

Thoughts?

  was:
Hello, I know it is a strange use case but I just wanted to append is cache to 
hive tables and turned out into this

def inCacheUdf = org.apache.spark.sql.functions.udf {name: String => 
sqlContext.isCached(name)}

val ts = sqlContext.tables

ts.withColumn("foo", inCacheUdf(ts.col("tableName"))).show

.. will throw a Null Pointer Exception.

I am not sure, since in my program I was getting:

A master url must be set in your configuration

or NullPointer at getConf as well.

Thoughts?


> Cannot call sqlContext inside udf
> -
>
> Key: SPARK-17735
> URL: https://issues.apache.org/jira/browse/SPARK-17735
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Saif Addin Ellafi
>Priority: Minor
>
> Hello, I know it is a strange use case but I just wanted to append is cache 
> to hive tables and turned out into this
> {noformat}
> def inCacheUdf = org.apache.spark.sql.functions.udf {name: String => 
> sqlContext.isCached(name)}
> val ts = sqlContext.tables
> ts.withColumn("foo", inCacheUdf(ts.col("tableName"))).show
> {noformat}
> .. will throw a Null Pointer Exception.
> I am not sure, since in my program I was getting:
> A master url must be set in your configuration
> or NullPointer at getConf as well.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17735) Cannot call sqlContext inside udf

2016-09-29 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534040#comment-15534040
 ] 

Herman van Hovell commented on SPARK-17735:
---

You really should not use {{sqlContext}} inside a UDF. As a general rule: you 
should never modify state outside the UDF in a UDF (it must be a pure function).

What are you trying to do?

> Cannot call sqlContext inside udf
> -
>
> Key: SPARK-17735
> URL: https://issues.apache.org/jira/browse/SPARK-17735
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Saif Addin Ellafi
>Priority: Minor
>
> Hello, I know it is a strange use case but I just wanted to append is cache 
> to hive tables and turned out into this
> def inCacheUdf = org.apache.spark.sql.functions.udf {name: String => 
> sqlContext.isCached(name)}
> val ts = sqlContext.tables
> ts.withColumn("foo", inCacheUdf(ts.col("tableName"))).show
> .. will throw a Null Pointer Exception.
> I am not sure, since in my program I was getting:
> A master url must be set in your configuration
> or NullPointer at getConf as well.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17721) Erroneous computation in multiplication of transposed SparseMatrix with SparseVector

2016-09-29 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534041#comment-15534041
 ] 

Joseph K. Bradley commented on SPARK-17721:
---

Noting here: We should audit MLlib for uses of this multiply to see what 
algorithms it might have affected.  I'm hoping the effects will have been 
minimal.

> Erroneous computation in multiplication of transposed SparseMatrix with 
> SparseVector
> 
>
> Key: SPARK-17721
> URL: https://issues.apache.org/jira/browse/SPARK-17721
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.4.0, 1.6.1, 2.0.0
> Environment: Verified on OS X with Spark 1.6.1 and on Databricks 
> running Spark 1.6.1
>Reporter: Bjarne Fruergaard
>Priority: Critical
>  Labels: correctness
>
> There is a bug in how a transposed SparseMatrix (isTransposed=true) does 
> multiplication with a SparseVector. The bug is present (for v. > 2.0.0) in 
> both org.apache.spark.mllib.linalg.BLAS (mllib) and 
> org.apache.spark.ml.linalg.BLAS (mllib-local) in the private gemv method with 
> signature:
> bq. gemv(alpha: Double, A: SparseMatrix, x: SparseVector, beta: Double, y: 
> DenseVector).
> This bug can be verified by running the following snippet in a Spark shell 
> (here using v1.6.1):
> {code:java}
> import com.holdenkarau.spark.testing.SharedSparkContext
> import org.apache.spark.mllib.linalg._
> val A = Matrices.dense(3, 2, Array[Double](0, 2, 1, 1, 2, 
> 0)).asInstanceOf[DenseMatrix].toSparse.transpose
> val b = Vectors.sparse(3, Seq[(Int, Double)]((1, 2), (2, 
> 1))).asInstanceOf[SparseVector]
> A.multiply(b)
> A.multiply(b.toDense)
> {code}
> The first multiply with the SparseMatrix returns the incorrect result:
> {code:java}
> org.apache.spark.mllib.linalg.DenseVector = [5.0,0.0]
> {code}
> whereas the correct result is returned by the second multiply:
> {code:java}
> org.apache.spark.mllib.linalg.DenseVector = [5.0,4.0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17671) Spark 2.0 history server summary page is slow even set spark.history.ui.maxApplications

2016-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534034#comment-15534034
 ] 

Apache Spark commented on SPARK-17671:
--

User 'wgtmac' has created a pull request for this issue:
https://github.com/apache/spark/pull/15303

> Spark 2.0 history server summary page is slow even set 
> spark.history.ui.maxApplications
> ---
>
> Key: SPARK-17671
> URL: https://issues.apache.org/jira/browse/SPARK-17671
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> This is a subsequent task of 
> [SPARK-17243|https://issues.apache.org/jira/browse/SPARK-17243]. After the 
> fix of SPARK-17243 (limit the number of applications in the JSON string 
> transferred from history server backend to web UI frontend), the history 
> server does display the target number of history summaries. 
> However, when there are more than 10k application history, it still gets 
> slower and slower. The problem is in the following code:
> {code:title=ApplicationListResource.scala|borderStyle=solid}
> @Produces(Array(MediaType.APPLICATION_JSON))
> private[v1] class ApplicationListResource(uiRoot: UIRoot) {
>   @GET
>   def appList(
>   @QueryParam("status") status: JList[ApplicationStatus],
>   @DefaultValue("2010-01-01") @QueryParam("minDate") minDate: 
> SimpleDateParam,
>   @DefaultValue("3000-01-01") @QueryParam("maxDate") maxDate: 
> SimpleDateParam,
>   @QueryParam("limit") limit: Integer)
>   : Iterator[ApplicationInfo] = {
> // although there is a limit operation in the end
> // the following line still does a transformation for all history 
> // in the list
> val allApps = uiRoot.getApplicationInfoList
> 
> // ...
> // irrelevant code is omitted 
> // ...
> if (limit != null) {
>   appList.take(limit)
> } else {
>   appList
> }
>   }
> }
> {code}
> What the code **uiRoot.getApplicationInfoList** does is to transform every 
> application history from class ApplicationHistoryInfo to class 
> ApplicationInfo. So if there are 10k applications, 10k transformations will 
> be done even we have limited 5000 jobs here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17735) Cannot call sqlContext inside udf

2016-09-29 Thread Saif Addin Ellafi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saif Addin Ellafi updated SPARK-17735:
--
Description: 
Hello, I know it is a strange use case but I just wanted to append is cache to 
hive tables and turned out into this

def inCacheUdf = org.apache.spark.sql.functions.udf {name: String => 
sqlContext.isCached(name)}

val ts = sqlContext.tables

ts.withColumn("foo", inCacheUdf(ts.col("tableName"))).show

.. will throw a Null Pointer Exception.

I am not sure, since in my program I was getting:

A master url must be set in your configuration

or NullPointer at getConf as well.

Thoughts?

  was:
Hello, I know it is a strange use case but I just wanted to append is cache to 
hive tables and turned out into this

def inCacheUdf = org.apache.spark.sql.functions.udf {name: String => 
sqlContext.isCached(name)}

val ts = sqlContext.tables

ts.withColumn("foo", inCacheUdf(ts.col("tableName"))).show

.. will throw a Null Pointer Exception.

I am not sure, since in my program I was getting:

A master url must be set in your configuration

or NullPointer a getConf as well.

Thoughts?


> Cannot call sqlContext inside udf
> -
>
> Key: SPARK-17735
> URL: https://issues.apache.org/jira/browse/SPARK-17735
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Saif Addin Ellafi
>Priority: Minor
>
> Hello, I know it is a strange use case but I just wanted to append is cache 
> to hive tables and turned out into this
> def inCacheUdf = org.apache.spark.sql.functions.udf {name: String => 
> sqlContext.isCached(name)}
> val ts = sqlContext.tables
> ts.withColumn("foo", inCacheUdf(ts.col("tableName"))).show
> .. will throw a Null Pointer Exception.
> I am not sure, since in my program I was getting:
> A master url must be set in your configuration
> or NullPointer at getConf as well.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17735) Cannot call sqlContext inside udf

2016-09-29 Thread Saif Addin Ellafi (JIRA)

Saif Addin Ellafi created SPARK-17735:
-

 Summary: Cannot call sqlContext inside udf
 Key: SPARK-17735
 URL: https://issues.apache.org/jira/browse/SPARK-17735
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.6.0
 Environment: Linux
Reporter: Saif Addin Ellafi
Priority: Minor


Hello, I know it is a strange use case but I just wanted to append is cache to 
hive tables and turned out into this

def inCacheUdf = org.apache.spark.sql.functions.udf {name: String => 
sqlContext.isCached(name)}

val ts = sqlContext.tables

ts.withColumn("foo", inCacheUdf(ts.col("tableName"))).show

.. will throw a Null Pointer Exception.

I am not sure, since in my program I was getting:

A master url must be set in your configuration

or NullPointer a getConf as well.

Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-29 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533996#comment-15533996
 ] 

Josh Rosen commented on SPARK-17733:


Actually, the above log segment wasn't super useful, so let me post some of the 
inferred filters instead, since that might make the pattern easier to spot:

{code}
Filter ((isnotnull(col8#7) && (col7#6L IN (coalesce(col7#6L, col7#6L, 
col7#6L),coalesce(col7#6L, col7#6L, col7#6L)) && (coalesce(coalesce(col7#6L, 
col7#6L, col7#6L), col7#6L, col7#6L) <=> coalesce(col7#6L, col7#6L, col7#6L))) 
&& (col7#6L = coalesce(col7#6L, col7#6L, col7#6L))) && (col7#6L <=> 
coalesce(col7#6L, col7#6L, col7#6L))) && coalesce(col7#6L, col7#6L, col7#6L) IN 
(col7#6L,col7#6L)) && col7#6L IN (col7#6L,col7#6L))) && col7#6L 
= coalesce(col7#6L, coalesce(col7#6L, col7#6L, col7#6L), coalesce(col7#6L, 
col7#6L, col7#6L))) && (coalesce(coalesce(col7#6L, col7#6L, col7#6L), col7#6L, 
col7#6L) = col7#6L)) && (coalesce(coalesce(col7#6L, col7#6L, col7#6L), col7#6L, 
col7#6L) <=> col7#6L)) && (coalesce(col7#6L, coalesce(col7#6L, col7#6L, 
col7#6L), coalesce(col7#6L, col7#6L, col7#6L)) <=> col7#6L)) && 
(coalesce(coalesce(coalesce(col7#6L, col7#6L, col7#6L), col7#6L, col7#6L), 
col7#6L, col7#6L) <=> coalesce(coalesce(col7#6L, col7#6L, col7#6L), col7#6L, 
col7#6L))) && (coalesce(col7#6L, col7#6L, col7#6L) = coalesce(coalesce(col7#6L, 
col7#6L, col7#6L), col7#6L, col7#6L))) && (coalesce(coalesce(col7#6L, col7#6L, 
col7#6L), coalesce(col7#6L, col7#6L, col7#6L), coalesce(col7#6L, col7#6L, 
col7#6L)) <=> col7#6L)) && (coalesce(coalesce(col7#6L, col7#6L, col7#6L), 
coalesce(col7#6L, col7#6L, col7#6L), coalesce(col7#6L, col7#6L, col7#6L)) <=> 
coalesce(col7#6L, coalesce(col7#6L, col7#6L, col7#6L), coalesce(col7#6L, 
col7#6L, col7#6L && coalesce(coalesce(col7#6L, col7#6L, col7#6L), col7#6L, 
col7#6L) IN (col7#6L,col7#6L)) && col7#6L IN (coalesce(col7#6L, 
coalesce(col7#6L, col7#6L, col7#6L), coalesce(col7#6L, col7#6L, 
col7#6L)),coalesce(col7#6L, coalesce(col7#6L, col7#6L, col7#6L), 
coalesce(col7#6L, col7#6L, col7#6L && coalesce(col7#6L, coalesce(col7#6L, 
col7#6L, col7#6L), coalesce(col7#6L, col7#6L, col7#6L)) IN (coalesce(col7#6L, 
col7#6L, col7#6L),coalesce(col7#6L, col7#6L, col7#6L))) && 
(coalesce(coalesce(col7#6L, coalesce(col7#6L, col7#6L, col7#6L), 
coalesce(col7#6L, col7#6L, col7#6L)), coalesce(col7#6L, col7#6L, col7#6L), 
coalesce(col7#6L, col7#6L, col7#6L)) <=> coalesce(col7#6L, coalesce(col7#6L, 
col7#6L, col7#6L), coalesce(col7#6L, col7#6L, col7#6L && (coalesce(col7#6L, 
col7#6L, col7#6L) <=> coalesce(col7#6L, coalesce(col7#6L, col7#6L, col7#6L), 
coalesce(col7#6L, col7#6L, col7#6L && coalesce(col7#6L, col7#6L, col7#6L) 
IN (coalesce(coalesce(col7#6L, col7#6L, col7#6L), col7#6L, 
col7#6L),coalesce(coalesce(col7#6L, col7#6L, col7#6L), col7#6L, col7#6L))) && 
(coalesce(col7#6L, coalesce(col7#6L, col7#6L, col7#6L), coalesce(col7#6L, 
col7#6L, col7#6L)) = coalesce(col7#6L, col7#6L, col7#6L))) && true))
{code}

and 

{code}
 Filter isnotnull(int_col_1#87) && isnotnull(int_col#85)) && 
(((cast(int_col#85 as bigint) = cast(int_col_1#87 as bigint)) && null) && 
null)) && null) && cast(int_col_1#87 as bigint) IN 
(cast(int_col_1#87 as bigint),cast(int_col_1#87 as bigint)) && 
(coalesce(coalesce(cast(int_col_1#87 as bigint), cast(int_col_1#87 as bigint), 
cast(int_col_1#87 as bigint)), cast(int_col_1#87 as bigint), cast(int_col_1#87 
as bigint)) <=> coalesce(cast(int_col_1#87 as bigint), cast(int_col_1#87 as 
bigint), cast(int_col_1#87 as bigint && (cast(int_col_1#87 as bigint) <=> 
coalesce(cast(int_col_1#87 as bigint), cast(int_col_1#87 as bigint), 
cast(int_col_1#87 as bigint && (cast(int_col_1#87 as bigint) = 
coalesce(cast(int_col_1#87 as bigint), cast(int_col_1#87 as bigint), 
cast(int_col_1#87 as bigint && cast(int_col_1#87 as bigint) IN 
(coalesce(cast(int_col_1#87 as bigint), cast(int_col_1#87 as bigint), 
cast(int_col_1#87 as bigint)),coalesce(cast(int_col_1#87 as bigint), 
cast(int_col_1#87 as bigint), cast(int_col_1#87 as bigint && 
coalesce(cast(int_col_1#87 as bigint), cast(int_col_1#87 as bigint), 
cast(int_col_1#87 as bigint)) IN (cast(int_col_1#87 as 
bigint),cast(int_col_1#87 as bigint))) || null) && ((null && isnull(null)) 
&& isnull(null)) && null) && null) && null) || null)) && null) && 
((cast(int_col#85 as bigint) IN (coalesce(cast(int_col#85 as bigint), 
cast(int_col#85 as bigint), cast(int_col#85 as 
bigint)),coalesce(cast(int_col#85 as bigint), cast(int_col#85 as bigint), 
cast(int_col#85 as bigint))) && (coalesce(coalesce(cast(int_col#85 as bigint), 
cast(int_col#85 as bigint), cast(int_col#85 as bigint)), cast(int_col#85 as 
bigint), cast(int_col#85 as bigint)) <=> coalesce(cast(int_col#85 as bigint), 
cast(int_col#85 as bigint), cast(int_col#85 as bigint && (cast(int_col#85 
as bigint) <=>

[jira] [Commented] (SPARK-17097) Pregel does not keep vertex state properly; fails to terminate

2016-09-29 Thread ding (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533976#comment-15533976
 ] 

ding commented on SPARK-17097:
--

Because diff of case class behaves different with regular class. A case class 
implements the equals method while a class does not. When comparing two objects 
implemented as a class is actually comparing the memory address of the objects. 
In above code, if we remove "case",  after transform, the original vertices is 
still different with the new generated vertices although they have the same 
value. In this way, EdgeTriplet is able to be updated since there is difference 
and after 1 iteration there will be no active message and the application will 
terminate. 

> Pregel does not keep vertex state properly; fails to terminate 
> ---
>
> Key: SPARK-17097
> URL: https://issues.apache.org/jira/browse/SPARK-17097
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
> Environment: Scala 2.10.5, Spark 1.6.0 with GraphX and Pregel
>Reporter: Seth Bromberger
>
> Consider the following minimum example:
> {code:title=PregelBug.scala|borderStyle=solid}
> package testGraph
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.graphx.{Edge, EdgeTriplet, Graph, _}
> object PregelBug {
>   def main(args: Array[String]) = {
> //FIXME breaks if TestVertex is a case class; works if not case class
> case class TestVertex(inId: VertexId,
>  inData: String,
>  inLabels: collection.mutable.HashSet[String]) extends 
> Serializable {
>   val id = inId
>   val value = inData
>   val labels = inLabels
> }
> class TestLink(inSrc: VertexId, inDst: VertexId, inData: String) extends 
> Serializable  {
>   val src = inSrc
>   val dst = inDst
>   val data = inData
> }
> val startString = "XXXSTARTXXX"
> val conf = new SparkConf().setAppName("pregeltest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val vertexes = Vector(
>   new TestVertex(0, "label0", collection.mutable.HashSet[String]()),
>   new TestVertex(1, "label1", collection.mutable.HashSet[String]())
> )
> val links = Vector(
>   new TestLink(0, 1, "linkData01")
> )
> val vertexes_packaged = vertexes.map(v => (v.id, v))
> val links_packaged = links.map(e => Edge(e.src, e.dst, e))
> val graph = Graph[TestVertex, 
> TestLink](sc.parallelize(vertexes_packaged), sc.parallelize(links_packaged))
> def vertexProgram (vertexId: VertexId, vdata: TestVertex, message: 
> Vector[String]): TestVertex = {
>   message.foreach {
> case `startString` =>
>   if (vdata.id == 0L)
> vdata.labels.add(vdata.value)
> case m =>
>   if (!vdata.labels.contains(m))
> vdata.labels.add(m)
>   }
>   new TestVertex(vdata.id, vdata.value, vdata.labels)
> }
> def sendMessage (triplet: EdgeTriplet[TestVertex, TestLink]): 
> Iterator[(VertexId, Vector[String])] = {
>   val srcLabels = triplet.srcAttr.labels
>   val dstLabels = triplet.dstAttr.labels
>   val msgsSrcDst = srcLabels.diff(dstLabels)
> .map(label => (triplet.dstAttr.id, Vector[String](label)))
>   val msgsDstSrc = dstLabels.diff(dstLabels)
> .map(label => (triplet.srcAttr.id, Vector[String](label)))
>   msgsSrcDst.toIterator ++ msgsDstSrc.toIterator
> }
> def mergeMessage (m1: Vector[String], m2: Vector[String]): Vector[String] 
> = m1.union(m2).distinct
> val g = graph.pregel(Vector[String](startString))(vertexProgram, 
> sendMessage, mergeMessage)
> println("---pregel done---")
> println("vertex info:")
> g.vertices.foreach(
>   v => {
> val labels = v._2.labels
> println(
>   "vertex " + v._1 +
> ": name = " + v._2.id +
> ", labels = " + labels)
>   }
> )
>   }
> }
> {code}
> This code never terminates even though we expect it to. To fix, we simply 
> remove the "case" designation for the TestVertex class (see FIXME comment), 
> and then it behaves as expected.
> (Apologies if this has been fixed in later versions; we're unfortunately 
> pegged to 2.10.5 / 1.6.0 for now.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17734) inner equi-join shorthand that returns Datasets, like DataFrame already has

2016-09-29 Thread Leif Warner (JIRA)

Leif Warner created SPARK-17734:
---

 Summary: inner equi-join shorthand that returns Datasets, like 
DataFrame already has
 Key: SPARK-17734
 URL: https://issues.apache.org/jira/browse/SPARK-17734
 Project: Spark
  Issue Type: Wish
Reporter: Leif Warner
Priority: Minor


There's an existing ".join(right: Dataset[_], usingColumn: String): DataFrame" 
method on Dataset.

Would appreciate it if a variant that returns typed Datasets would also 
available.

If you write a join that contains the common column name name, you get an 
AnalysisException thrown because that's ambiguous, e.g:
$"foo" === $"foo"
So I wrote table1.toDF()("foo") === table2.toDF()("foo"), but that's a little 
error prone, and coworkers considered it a hack and didn't want to use it, 
because it "mixes DataFrame and Dataset api".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-29 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17733:
---
Attachment: constraints.png

> InferFiltersFromConstraints rule never terminates for query
> ---
>
> Key: SPARK-17733
> URL: https://issues.apache.org/jira/browse/SPARK-17733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Critical
> Attachments: 
> SparkSubmit-2016-09-29-1_snapshot___Users_joshrosen_Snapshots__-_YourKit_Java_Profiler_2013_build_13088_-_64-bit.png,
>  constraints.png
>
>
> The following (complicated) example becomes stuck in the 
> {{InferFiltersFromConstraints}} rule and never runs. However, it doesn't fail 
> with a stack overflow and doesn't hit the limit on optimization passes, so I 
> think there's some sort of non-obvious infinite loop within the rule itself.
> {code:title=Table Creation|borderStyle=solid}
>  -- Query #0
> CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, decimal2610_col_3, 
> boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, timestamp_col_8) 
> AS VALUES
>   (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
> TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
> TIMESTAMP('2007-01-13 00:00:00.0')),
>   (CAST(722.4906 AS FLOAT), true, 497.54BD, true, 
> TIMESTAMP('2015-12-14 00:00:00.0'), false, CAST(268 AS BIGINT), 
> TIMESTAMP('2021-04-19 00:00:00.0')),
>   (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
> TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
> TIMESTAMP('2019-10-16 00:00:00.0')),
>   (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
> TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462 AS BIGINT), CAST(NULL AS 
> TIMESTAMP)),
>   (CAST(46.395535 AS FLOAT), true, -662.89BD, true, 
> TIMESTAMP('2000-10-16 00:00:00.0'), false, CAST(-656 AS BIGINT), 
> TIMESTAMP('2024-09-01 00:00:00.0')),
>   (CAST(-555.36285 AS FLOAT), true, -938.93BD, true, 
> TIMESTAMP('2007-04-10 00:00:00.0'), true, CAST(252 AS BIGINT), 
> TIMESTAMP('2028-12-03 00:00:00.0')),
>   (CAST(826.29004 AS FLOAT), true, 53.18BD, false, 
> TIMESTAMP('2004-06-11 00:00:00.0'), false, CAST(437 AS BIGINT), 
> TIMESTAMP('1994-04-04 00:00:00.0')),
>   (CAST(-15.276999 AS FLOAT), CAST(NULL AS BOOLEAN), -889.31BD, true, 
> TIMESTAMP('1991-05-23 00:00:00.0'), true, CAST(226 AS BIGINT), 
> TIMESTAMP('2023-07-08 00:00:00.0')),
>   (CAST(385.27386 AS FLOAT), CAST(NULL AS BOOLEAN), -9.95BD, false, 
> TIMESTAMP('2022-10-22 00:00:00.0'), true, CAST(430 AS BIGINT), 
> TIMESTAMP('2013-09-29 00:00:00.0')),
>   (CAST(988.7868 AS FLOAT), CAST(NULL AS BOOLEAN), 715.17BD, false, 
> TIMESTAMP('2026-10-03 00:00:00.0'), true, CAST(-696 AS BIGINT), 
> TIMESTAMP('1990-08-10 00:00:00.0'))
>  ;
>  -- Query #1
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS VALUES
>   (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
>   (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 00:00:00.0'), 
> CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
>   (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 00:00:00.0'), 
> CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 00:00:00.0'), 
> '211', -959, CAST(NULL AS STRING)),
>   (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
>   (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 00:00:00.0'), 
> CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, TIMESTAMP('2028-06-27 
> 00:00:00.0'), '-657', 948, '18'),
>   (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 00:00:00.0'), 
> CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 00:00:00.0'), 
> '-345', 566, '-574'),
>   (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 00:00:00.0'), 
> CAST(972 AS SMALLINT), true, CAST(NULL AS INT), TIMESTAMP('2026-06-10 
> 00:00:00.0'), '518', 683, '-320'),
>   (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142'),
>   (CAST(-836.513475295 AS DOUBLE), true, TIMESTAMP('2027-01-02 00:00:00.0'), 
> CAST(-446 AS SMALLINT), true, CAST(NULL AS INT), TIMESTAMP('1993-09-01 
>

[jira] [Updated] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-29 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17733:
---
Attachment: 
SparkSubmit-2016-09-29-1_snapshot___Users_joshrosen_Snapshots__-_YourKit_Java_Profiler_2013_build_13088_-_64-bit.png

> InferFiltersFromConstraints rule never terminates for query
> ---
>
> Key: SPARK-17733
> URL: https://issues.apache.org/jira/browse/SPARK-17733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Critical
> Attachments: 
> SparkSubmit-2016-09-29-1_snapshot___Users_joshrosen_Snapshots__-_YourKit_Java_Profiler_2013_build_13088_-_64-bit.png
>
>
> The following (complicated) example becomes stuck in the 
> {{InferFiltersFromConstraints}} rule and never runs. However, it doesn't fail 
> with a stack overflow and doesn't hit the limit on optimization passes, so I 
> think there's some sort of non-obvious infinite loop within the rule itself.
> {code:title=Table Creation|borderStyle=solid}
>  -- Query #0
> CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, decimal2610_col_3, 
> boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, timestamp_col_8) 
> AS VALUES
>   (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
> TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
> TIMESTAMP('2007-01-13 00:00:00.0')),
>   (CAST(722.4906 AS FLOAT), true, 497.54BD, true, 
> TIMESTAMP('2015-12-14 00:00:00.0'), false, CAST(268 AS BIGINT), 
> TIMESTAMP('2021-04-19 00:00:00.0')),
>   (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
> TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
> TIMESTAMP('2019-10-16 00:00:00.0')),
>   (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
> TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462 AS BIGINT), CAST(NULL AS 
> TIMESTAMP)),
>   (CAST(46.395535 AS FLOAT), true, -662.89BD, true, 
> TIMESTAMP('2000-10-16 00:00:00.0'), false, CAST(-656 AS BIGINT), 
> TIMESTAMP('2024-09-01 00:00:00.0')),
>   (CAST(-555.36285 AS FLOAT), true, -938.93BD, true, 
> TIMESTAMP('2007-04-10 00:00:00.0'), true, CAST(252 AS BIGINT), 
> TIMESTAMP('2028-12-03 00:00:00.0')),
>   (CAST(826.29004 AS FLOAT), true, 53.18BD, false, 
> TIMESTAMP('2004-06-11 00:00:00.0'), false, CAST(437 AS BIGINT), 
> TIMESTAMP('1994-04-04 00:00:00.0')),
>   (CAST(-15.276999 AS FLOAT), CAST(NULL AS BOOLEAN), -889.31BD, true, 
> TIMESTAMP('1991-05-23 00:00:00.0'), true, CAST(226 AS BIGINT), 
> TIMESTAMP('2023-07-08 00:00:00.0')),
>   (CAST(385.27386 AS FLOAT), CAST(NULL AS BOOLEAN), -9.95BD, false, 
> TIMESTAMP('2022-10-22 00:00:00.0'), true, CAST(430 AS BIGINT), 
> TIMESTAMP('2013-09-29 00:00:00.0')),
>   (CAST(988.7868 AS FLOAT), CAST(NULL AS BOOLEAN), 715.17BD, false, 
> TIMESTAMP('2026-10-03 00:00:00.0'), true, CAST(-696 AS BIGINT), 
> TIMESTAMP('1990-08-10 00:00:00.0'))
>  ;
>  -- Query #1
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS VALUES
>   (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
>   (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 00:00:00.0'), 
> CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
>   (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 00:00:00.0'), 
> CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 00:00:00.0'), 
> '211', -959, CAST(NULL AS STRING)),
>   (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
>   (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 00:00:00.0'), 
> CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, TIMESTAMP('2028-06-27 
> 00:00:00.0'), '-657', 948, '18'),
>   (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 00:00:00.0'), 
> CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 00:00:00.0'), 
> '-345', 566, '-574'),
>   (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 00:00:00.0'), 
> CAST(972 AS SMALLINT), true, CAST(NULL AS INT), TIMESTAMP('2026-06-10 
> 00:00:00.0'), '518', 683, '-320'),
>   (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142'),
>   (CAST(-836.513475295 AS DOUBLE), true, TIMESTAMP('2027-01-02

[jira] [Created] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-29 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-17733:
--

 Summary: InferFiltersFromConstraints rule never terminates for 
query
 Key: SPARK-17733
 URL: https://issues.apache.org/jira/browse/SPARK-17733
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Josh Rosen
Priority: Critical
 Attachments: 
SparkSubmit-2016-09-29-1_snapshot___Users_joshrosen_Snapshots__-_YourKit_Java_Profiler_2013_build_13088_-_64-bit.png

The following (complicated) example becomes stuck in the 
{{InferFiltersFromConstraints}} rule and never runs. However, it doesn't fail 
with a stack overflow and doesn't hit the limit on optimization passes, so I 
think there's some sort of non-obvious infinite loop within the rule itself.


{code:title=Table Creation|borderStyle=solid}
 -- Query #0

CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, decimal2610_col_3, 
boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, timestamp_col_8) 
AS VALUES
  (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
TIMESTAMP('2007-01-13 00:00:00.0')),
  (CAST(722.4906 AS FLOAT), true, 497.54BD, true, TIMESTAMP('2015-12-14 
00:00:00.0'), false, CAST(268 AS BIGINT), TIMESTAMP('2021-04-19 00:00:00.0')),
  (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
TIMESTAMP('2019-10-16 00:00:00.0')),
  (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462 AS BIGINT), CAST(NULL AS 
TIMESTAMP)),
  (CAST(46.395535 AS FLOAT), true, -662.89BD, true, 
TIMESTAMP('2000-10-16 00:00:00.0'), false, CAST(-656 AS BIGINT), 
TIMESTAMP('2024-09-01 00:00:00.0')),
  (CAST(-555.36285 AS FLOAT), true, -938.93BD, true, 
TIMESTAMP('2007-04-10 00:00:00.0'), true, CAST(252 AS BIGINT), 
TIMESTAMP('2028-12-03 00:00:00.0')),
  (CAST(826.29004 AS FLOAT), true, 53.18BD, false, 
TIMESTAMP('2004-06-11 00:00:00.0'), false, CAST(437 AS BIGINT), 
TIMESTAMP('1994-04-04 00:00:00.0')),
  (CAST(-15.276999 AS FLOAT), CAST(NULL AS BOOLEAN), -889.31BD, true, 
TIMESTAMP('1991-05-23 00:00:00.0'), true, CAST(226 AS BIGINT), 
TIMESTAMP('2023-07-08 00:00:00.0')),
  (CAST(385.27386 AS FLOAT), CAST(NULL AS BOOLEAN), -9.95BD, false, 
TIMESTAMP('2022-10-22 00:00:00.0'), true, CAST(430 AS BIGINT), 
TIMESTAMP('2013-09-29 00:00:00.0')),
  (CAST(988.7868 AS FLOAT), CAST(NULL AS BOOLEAN), 715.17BD, false, 
TIMESTAMP('2026-10-03 00:00:00.0'), true, CAST(-696 AS BIGINT), 
TIMESTAMP('1990-08-10 00:00:00.0'))
 ;


 -- Query #1

CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
int_col_9, string_col_10) AS VALUES
  (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), TIMESTAMP('2012-10-19 
00:00:00.0'), CAST(9 AS SMALLINT), false, 77, TIMESTAMP('2014-07-01 
00:00:00.0'), '-945', -646, '722'),
  (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 00:00:00.0'), 
CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
  (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 00:00:00.0'), 
CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 00:00:00.0'), '211', 
-959, CAST(NULL AS STRING)),
  (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
CAST(NULL AS INT), '936'),
  (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 00:00:00.0'), 
CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, TIMESTAMP('2028-06-27 
00:00:00.0'), '-657', 948, '18'),
  (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 00:00:00.0'), 
CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 00:00:00.0'), '-345', 
566, '-574'),
  (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 00:00:00.0'), 
CAST(972 AS SMALLINT), true, CAST(NULL AS INT), TIMESTAMP('2026-06-10 
00:00:00.0'), '518', 683, '-320'),
  (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142'),
  (CAST(-836.513475295 AS DOUBLE), true, TIMESTAMP('2027-01-02 00:00:00.0'), 
CAST(-446 AS SMALLINT), true, CAST(NULL AS INT), TIMESTAMP('1993-09-01 
00:00:00.0'), '771', CAST(NULL AS INT), '977'),
  (CAST(-768.883638815 AS DOUBLE), false, TIMESTAMP('1994-02-11 00:00:00.0'), 
CAST(-244 AS SMALLINT), true, -493, TIMESTAMP('1994-01-02 00:00:00.0'), '-921', 
CAST(NULL AS INT), '-409')
 ;


 -- Query #2

CREATE TEMPORARY VIEW table_5(float_col_1, varchar0138_col_2, string_col_3, 
decimal2211_col_4, float_col_5,

[jira] [Commented] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators

2016-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533935#comment-15533935
 ] 

Apache Spark commented on SPARK-17732:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15302

> ALTER TABLE DROP PARTITION should support comparators
> -
>
> Key: SPARK-17732
> URL: https://issues.apache.org/jira/browse/SPARK-17732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>
> This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in 
> Apache Spark 2.0 for backward compatibility.
> *Spark 1.6.2*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> {code}
> *Spark 2.0*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '<' expecting {')', ','}(line 1, pos 42)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators

2016-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17732:


Assignee: (was: Apache Spark)

> ALTER TABLE DROP PARTITION should support comparators
> -
>
> Key: SPARK-17732
> URL: https://issues.apache.org/jira/browse/SPARK-17732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>
> This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in 
> Apache Spark 2.0 for backward compatibility.
> *Spark 1.6.2*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> {code}
> *Spark 2.0*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '<' expecting {')', ','}(line 1, pos 42)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators

2016-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17732:


Assignee: Apache Spark

> ALTER TABLE DROP PARTITION should support comparators
> -
>
> Key: SPARK-17732
> URL: https://issues.apache.org/jira/browse/SPARK-17732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in 
> Apache Spark 2.0 for backward compatibility.
> *Spark 1.6.2*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> {code}
> *Spark 2.0*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '<' expecting {')', ','}(line 1, pos 42)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators

2016-09-29 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-17732:
-

 Summary: ALTER TABLE DROP PARTITION should support comparators
 Key: SPARK-17732
 URL: https://issues.apache.org/jira/browse/SPARK-17732
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Dongjoon Hyun


This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in 
Apache Spark 2.0 for backward compatibility.

*Spark 1.6.2*
{code}
scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, quarter 
STRING)")
res0: org.apache.spark.sql.DataFrame = [result: string]

scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
res1: org.apache.spark.sql.DataFrame = [result: string]
{code}

*Spark 2.0*
{code}
scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, quarter 
STRING)")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '<' expecting {')', ','}(line 1, pos 42)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17699) from_json function for parsing json Strings into Structs

2016-09-29 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17699.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15274
[https://github.com/apache/spark/pull/15274]

> from_json function for parsing json Strings into Structs
> 
>
> Key: SPARK-17699
> URL: https://issues.apache.org/jira/browse/SPARK-17699
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 2.1.0
>
>
> Today, we have good support for reading standalone JSON data.  However, 
> sometimes (especially when reading from streaming sources such as Kafka) the 
> JSON is embedded in an envelope that has other information we'd like to 
> preserve.  It would be nice if we could also parse JSON string columns, while 
> preserving the original JSON schema. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17715) Log INFO per task launch creates a large driver log

2016-09-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-17715.
---
  Resolution: Fixed
Assignee: Brian Cho
   Fix Version/s: 2.1.0
Target Version/s: 2.1.0

> Log INFO per task launch creates a large driver log
> ---
>
> Key: SPARK-17715
> URL: https://issues.apache.org/jira/browse/SPARK-17715
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Brian Cho
>Assignee: Brian Cho
>Priority: Trivial
> Fix For: 2.1.0
>
>
> CoarseGrainedSchedulerBackend is logging an INFO every time a task is 
> launched. Previously, only executor-related events were logged as INFO.
> I propose dropping the task logging to a lower level, as task launches can 
> happen orders of magnitude more than executor registration. (I've found the 
> executor logs very useful, so I don't want to block out both via log4j 
> configuration.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17672) Spark 2.0 history server web Ui takes too long for a single application

2016-09-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-17672:
--
Assignee: Gang Wu

> Spark 2.0 history server web Ui takes too long for a single application
> ---
>
> Key: SPARK-17672
> URL: https://issues.apache.org/jira/browse/SPARK-17672
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>Assignee: Gang Wu
> Fix For: 2.0.1
>
>
> When there are 10K application history in the history server back end, it can 
> take a very long time to even get a single application history page. After 
> some investigation, I found the root cause was the following piece of code: 
> {code:title=OneApplicationResource.scala|borderStyle=solid}
> @Produces(Array(MediaType.APPLICATION_JSON))
> private[v1] class OneApplicationResource(uiRoot: UIRoot) {
>   @GET
>   def getApp(@PathParam("appId") appId: String): ApplicationInfo = {
> val apps = uiRoot.getApplicationInfoList.find { _.id == appId }
> apps.getOrElse(throw new NotFoundException("unknown app: " + appId))
>   }
> }
> {code}
> Although all application history infos are stored in a LinkedHashMap, here to 
> code transforms the map to an iterator and then uses the find() api which is 
> O( n) instead of O(1) from a map.get() operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17672) Spark 2.0 history server web Ui takes too long for a single application

2016-09-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-17672.
---
  Resolution: Fixed
   Fix Version/s: 2.0.1
Target Version/s: 2.0.1

> Spark 2.0 history server web Ui takes too long for a single application
> ---
>
> Key: SPARK-17672
> URL: https://issues.apache.org/jira/browse/SPARK-17672
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Gang Wu
> Fix For: 2.0.1
>
>
> When there are 10K application history in the history server back end, it can 
> take a very long time to even get a single application history page. After 
> some investigation, I found the root cause was the following piece of code: 
> {code:title=OneApplicationResource.scala|borderStyle=solid}
> @Produces(Array(MediaType.APPLICATION_JSON))
> private[v1] class OneApplicationResource(uiRoot: UIRoot) {
>   @GET
>   def getApp(@PathParam("appId") appId: String): ApplicationInfo = {
> val apps = uiRoot.getApplicationInfoList.find { _.id == appId }
> apps.getOrElse(throw new NotFoundException("unknown app: " + appId))
>   }
> }
> {code}
> Although all application history infos are stored in a LinkedHashMap, here to 
> code transforms the map to an iterator and then uses the find() api which is 
> O( n) instead of O(1) from a map.get() operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17731) Metrics for Structured Streaming

2016-09-29 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-17731:
-

Assignee: Tathagata Das

> Metrics for Structured Streaming
> 
>
> Key: SPARK-17731
> URL: https://issues.apache.org/jira/browse/SPARK-17731
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Metrics are needed for monitoring structured streaming apps. Here is the 
> design doc for implementing the necessary metrics.
> https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17731) Metrics for Structured Streaming

2016-09-29 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-17731:
-

 Summary: Metrics for Structured Streaming
 Key: SPARK-17731
 URL: https://issues.apache.org/jira/browse/SPARK-17731
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Tathagata Das


Metrics are needed for monitoring structured streaming apps. Here is the design 
doc for implementing the necessary metrics.
https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17648) TaskSchedulerImpl.resourceOffers should take an IndexedSeq, not a Seq

2016-09-29 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-17648.
---
  Resolution: Fixed
   Fix Version/s: 2.1.0
Target Version/s: 2.1.0

> TaskSchedulerImpl.resourceOffers should take an IndexedSeq, not a Seq
> -
>
> Key: SPARK-17648
> URL: https://issues.apache.org/jira/browse/SPARK-17648
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.1.0
>
>
> {{TaskSchedulerImpl.resourceOffer}} takes in a {{Seq[WorkerOffer]}}.  
> however, later on it indexes into this by position.  If you don't pass in an 
> {{IndexedSeq}}, this turns an O(n) operation in an O(n^2) operation.
> In practice, this isn't an issue, since just by chance the important places 
> this is called, the datastructures happen to already be {{IndexedSeq}} s.  
> But we ought to tighten up the types to make this more clear.  I ran into 
> this while doing some performance tests on the scheduler, and performance was 
> terrible when I passed in a {{Seq}} and even a few hundred offers were 
> scheduled very slowly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17666) take() or isEmpty() on dataset leaks s3a connections

2016-09-29 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17666:
---
Target Version/s: 2.0.1, 2.1.0  (was: 2.0.2, 2.1.0)

> take() or isEmpty() on dataset leaks s3a connections
> 
>
> Key: SPARK-17666
> URL: https://issues.apache.org/jira/browse/SPARK-17666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: ubuntu/centos, java 7, java 8, spark 2.0, java api
>Reporter: Igor Berman
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 2.0.1, 2.1.0
>
>
> I'm experiensing problems with s3a and working with parquet with dataset api
> the symptom of problem - tasks failing with 
> {code}
> Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout 
> waiting for connection from pool
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:232)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:199)
>   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> {code}
> Checking CoarseGrainedExecutorBackend with lsof gave me many sockets in 
> CLOSE_WAIT state
> reproduction of problem:
> {code}
> package com.test;
> import java.text.ParseException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> public class ConnectionLeakTest {
>   public static void main(String[] args) throws ParseException {
>   SparkConf sparkConf = new SparkConf();
>   sparkConf.setMaster("local[*]");
>   sparkConf.setAppName("Test");
>   sparkConf.set("spark.local.dir", "/tmp/spark");
>   sparkConf.set("spark.sql.shuffle.partitions", "2");
>   SparkSession session = 
> SparkSession.builder().config(sparkConf).getOrCreate();
> //set your credentials to your bucket
>   for (int i = 0; i < 100; i++) {
>   Dataset df = session
>   .sqlContext()
>   .read()
>   .parquet("s3a://test/*");//contains 
> multiple snappy compressed parquet files
>   if (df.rdd().isEmpty()) {//same problem with 
> takeAsList().isEmpty()
>   System.out.println("Yes");
>   } else {
>   System.out.println("No");
>   }
>   }
>   System.out.println("Done");
>   }
> }
> {code}
> so when program runs, you can jps for pid and do lsof -p  | grep https
> and you'll see constant grow of CLOSE_WAITs
> Our way to bypass problem is to use count() == 0
> In addition we've seen that df.dropDuplicates("a","b").rdd().isEmpty() 
> doesn't produce problem too



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17666) take() or isEmpty() on dataset leaks s3a connections

2016-09-29 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-17666:
--

Assignee: Josh Rosen

> take() or isEmpty() on dataset leaks s3a connections
> 
>
> Key: SPARK-17666
> URL: https://issues.apache.org/jira/browse/SPARK-17666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: ubuntu/centos, java 7, java 8, spark 2.0, java api
>Reporter: Igor Berman
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 2.0.1, 2.1.0
>
>
> I'm experiensing problems with s3a and working with parquet with dataset api
> the symptom of problem - tasks failing with 
> {code}
> Caused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout 
> waiting for connection from pool
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:232)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:199)
>   at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> {code}
> Checking CoarseGrainedExecutorBackend with lsof gave me many sockets in 
> CLOSE_WAIT state
> reproduction of problem:
> {code}
> package com.test;
> import java.text.ParseException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> public class ConnectionLeakTest {
>   public static void main(String[] args) throws ParseException {
>   SparkConf sparkConf = new SparkConf();
>   sparkConf.setMaster("local[*]");
>   sparkConf.setAppName("Test");
>   sparkConf.set("spark.local.dir", "/tmp/spark");
>   sparkConf.set("spark.sql.shuffle.partitions", "2");
>   SparkSession session = 
> SparkSession.builder().config(sparkConf).getOrCreate();
> //set your credentials to your bucket
>   for (int i = 0; i < 100; i++) {
>   Dataset df = session
>   .sqlContext()
>   .read()
>   .parquet("s3a://test/*");//contains 
> multiple snappy compressed parquet files
>   if (df.rdd().isEmpty()) {//same problem with 
> takeAsList().isEmpty()
>   System.out.println("Yes");
>   } else {
>   System.out.println("No");
>   }
>   }
>   System.out.println("Done");
>   }
> }
> {code}
> so when program runs, you can jps for pid and do lsof -p  | grep https
> and you'll see constant grow of CLOSE_WAITs
> Our way to bypass problem is to use count() == 0
> In addition we've seen that df.dropDuplicates("a","b").rdd().isEmpty() 
> doesn't produce problem too



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17712) Incorrect result due to invalid pushdown of data-independent filter beneath aggregate

2016-09-29 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17712:
---
Fix Version/s: 2.0.2

> Incorrect result due to invalid pushdown of data-independent filter beneath 
> aggregate
> -
>
> Key: SPARK-17712
> URL: https://issues.apache.org/jira/browse/SPARK-17712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0, 2.0.2
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>  Labels: correctness
> Fix For: 2.0.2, 2.1.0
>
>
> Let {{diamonds}} be a non-empty table. The following two queries should both 
> return no rows, but the first returns a single row:
> {code}
> SELECT
> 1
> FROM (
> SELECT
> count(*)
> FROM diamonds
> ) t1
> WHERE
> false
> {code}
> {code}
> SELECT
> 1
> FROM (
> SELECT
> *
> FROM diamonds
> ) t1
> WHERE
> false
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 201 matches

Mail list logo