[jira] [Assigned] (SPARK-19650) Metastore-only operations shouldn't trigger a spark job

2017-02-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19650:
---

Assignee: Herman van Hovell  (was: Sameer Agarwal)

> Metastore-only operations shouldn't trigger a spark job
> ---
>
> Key: SPARK-19650
> URL: https://issues.apache.org/jira/browse/SPARK-19650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sameer Agarwal
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> We currently trigger a spark job even for simple metastore operations ({{SHOW 
> TABLES}}, {{SHOW DATABASES}}, {{CREATE TABLE}} etc.). Even though these 
> otherwise get executed on a driver, it prevents a user from doing these 
> operations on a driver-only cluster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19650) Metastore-only operations shouldn't trigger a spark job

2017-02-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19650.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17027
[https://github.com/apache/spark/pull/17027]

> Metastore-only operations shouldn't trigger a spark job
> ---
>
> Key: SPARK-19650
> URL: https://issues.apache.org/jira/browse/SPARK-19650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.2.0
>
>
> We currently trigger a spark job even for simple metastore operations ({{SHOW 
> TABLES}}, {{SHOW DATABASES}}, {{CREATE TABLE}} etc.). Even though these 
> otherwise get executed on a driver, it prevents a user from doing these 
> operations on a driver-only cluster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19735) Remove HOLD_DDLTIME from Catalog APIs

2017-02-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19735.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17063
[https://github.com/apache/spark/pull/17063]

> Remove HOLD_DDLTIME from Catalog APIs
> -
>
> Key: SPARK-19735
> URL: https://issues.apache.org/jira/browse/SPARK-19735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> As explained in Hive JIRA https://issues.apache.org/jira/browse/HIVE-12224,  
> HOLD_DDLTIME was broken as soon as it landed. Hive 2.0 removes HOLD_DDLTIME 
> from the API. In Spark SQL, we always set it to FALSE. Like Hive, we should 
> also remove it from our Catalog APIs. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17075) Cardinality Estimation of Predicate Expressions

2017-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884084#comment-15884084
 ] 

Apache Spark commented on SPARK-17075:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17065

> Cardinality Estimation of Predicate Expressions
> ---
>
> Key: SPARK-17075
> URL: https://issues.apache.org/jira/browse/SPARK-17075
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>Assignee: Ron Hu
> Fix For: 2.2.0
>
>
> A filter condition is the predicate expression specified in the WHERE clause 
> of a SQL select statement.  A predicate can be a compound logical expression 
> with logical AND, OR, NOT operators combining multiple single conditions.  A 
> single condition usually has comparison operators such as =, <, <=, >, >=, 
> ‘like’, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-02-24 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19737:
---
Description: 
Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocations
# Look up the function names from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.

  was:
Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocation
# Look up the function name from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.


> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an invalid 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, then it may take the analyzer a long time 
> before realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-02-24 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19737:
---
Description: 
Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocation
# Look up the function name from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't actually try to resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly and make sure it gets executed before {{ResolveRelations}}.

  was:
Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocation
# Look up the function name from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't try to actually resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly.


> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an invalid 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, then it may take the analyzer a long time 
> before realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocation
> # Look up the function name from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, 

[jira] [Assigned] (SPARK-19736) refreshByPath should clear all cached plans with the specified path

2017-02-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19736:


Assignee: (was: Apache Spark)

> refreshByPath should clear all cached plans with the specified path
> ---
>
> Key: SPARK-19736
> URL: https://issues.apache.org/jira/browse/SPARK-19736
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
>
> Catalog.refreshByPath can refresh the cache entry and the associated metadata 
> for all dataframes (if any), that contain the given data source path. 
> However, CacheManager.invalidateCachedPath doesn't clear all cached plans 
> with the specified path. It causes some strange behaviors reported in 
> SPARK-15678.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19736) refreshByPath should clear all cached plans with the specified path

2017-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884065#comment-15884065
 ] 

Apache Spark commented on SPARK-19736:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/17064

> refreshByPath should clear all cached plans with the specified path
> ---
>
> Key: SPARK-19736
> URL: https://issues.apache.org/jira/browse/SPARK-19736
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
>
> Catalog.refreshByPath can refresh the cache entry and the associated metadata 
> for all dataframes (if any), that contain the given data source path. 
> However, CacheManager.invalidateCachedPath doesn't clear all cached plans 
> with the specified path. It causes some strange behaviors reported in 
> SPARK-15678.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19736) refreshByPath should clear all cached plans with the specified path

2017-02-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19736:


Assignee: Apache Spark

> refreshByPath should clear all cached plans with the specified path
> ---
>
> Key: SPARK-19736
> URL: https://issues.apache.org/jira/browse/SPARK-19736
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Catalog.refreshByPath can refresh the cache entry and the associated metadata 
> for all dataframes (if any), that contain the given data source path. 
> However, CacheManager.invalidateCachedPath doesn't clear all cached plans 
> with the specified path. It causes some strange behaviors reported in 
> SPARK-15678.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2017-02-24 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-19737:
--

 Summary: New analysis rule for reporting unregistered functions 
without relying on relation resolution
 Key: SPARK-19737
 URL: https://issues.apache.org/jira/browse/SPARK-19737
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Cheng Lian
 Fix For: 2.2.0


Let's consider the following simple SQL query that reference an invalid 
function {{foo}} that is never registered in the function registry:
{code:sql}
SELECT foo(a) FROM t
{code}
Assuming table {{t}} is a partitioned  temporary view consisting of a large 
number of files stored on S3, then it may take the analyzer a long time before 
realizing that {{foo}} is not registered yet.

The reason is that the existing analysis rule {{ResolveFunctions}} requires all 
child expressions to be resolved first. Therefore, {{ResolveRelations}} has to 
be executed first to resolve all columns referenced by the unresolved function 
invocation. This further leads to partition discovery for {{t}}, which may take 
a long time.

To address this case, we propose a new lightweight analysis rule 
{{LookupFunctions}} that
# Matches all unresolved function invocation
# Look up the function name from the function registry
# Report analysis error for any unregistered functions

Since this rule doesn't try to actually resolve the unresolved functions, it 
doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
discovery.

We may put this analysis rule in a separate {{Once}} rule batch that sits 
between the "Substitution" batch and the "Resolution" batch to avoid running it 
repeatedly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15678) Not use cache on appends and overwrites

2017-02-24 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884061#comment-15884061
 ] 

Liang-Chi Hsieh commented on SPARK-15678:
-

[~kiszk][~gen] I created SPARK-19736 for the reported issue. A PR is also 
submitted.

> Not use cache on appends and overwrites
> ---
>
> Key: SPARK-15678
> URL: https://issues.apache.org/jira/browse/SPARK-15678
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> SparkSQL currently doesn't drop caches if the underlying data is overwritten.
> {code}
> val dir = "/tmp/test"
> sqlContext.range(1000).write.mode("overwrite").parquet(dir)
> val df = sqlContext.read.parquet(dir).cache()
> df.count() // outputs 1000
> sqlContext.range(10).write.mode("overwrite").parquet(dir)
> sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We 
> are still using the cached dataset
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19736) refreshByPath should clear all cached plans with the specified path

2017-02-24 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-19736:
---

 Summary: refreshByPath should clear all cached plans with the 
specified path
 Key: SPARK-19736
 URL: https://issues.apache.org/jira/browse/SPARK-19736
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Liang-Chi Hsieh


Catalog.refreshByPath can refresh the cache entry and the associated metadata 
for all dataframes (if any), that contain the given data source path. 

However, CacheManager.invalidateCachedPath doesn't clear all cached plans with 
the specified path. It causes some strange behaviors reported in SPARK-15678.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19352) Sorting issues on relatively big datasets

2017-02-24 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884032#comment-15884032
 ] 

Wenchen Fan commented on SPARK-19352:
-

I'm going to mark it as `not a problem`. Spark doesn't guarantee the sorting 
when writing data out, although now the data will sorted as you expected in 
your example, but it depends on the implementation details and may change in 
the future.

> Sorting issues on relatively big datasets
> -
>
> Key: SPARK-19352
> URL: https://issues.apache.org/jira/browse/SPARK-19352
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: Spark version 2.1.0
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
> macOS 10.12.3
>Reporter: Ivan Gozali
>
> _More details, including the script to generate the synthetic dataset 
> (requires pandas and numpy) are in this GitHub gist._
> https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f
> Given a relatively large synthetic time series dataset of various users 
> (4.1GB), when attempting to:
> * partition this dataset by user ID
> * sort the time series data for each user by timestamp
> * write each partition to a single CSV file
> then some files are unsorted in a very specific manner. In one of the 
> supposedly sorted files, the rows looked as follows:
> {code}
> 2014-01-01T00:00:00.000-08:00,-0.07,0.39,-0.39
> 2014-12-31T02:07:30.000-08:00,0.34,-0.62,-0.22
> 2014-01-01T00:00:05.000-08:00,-0.07,-0.52,0.47
> 2014-12-31T02:07:35.000-08:00,-0.15,-0.13,-0.14
> 2014-01-01T00:00:10.000-08:00,-1.31,-1.17,2.24
> 2014-12-31T02:07:40.000-08:00,-1.28,0.88,-0.43
> {code}
> The above is attempted using the following Scala/Spark code:
> {code}
> val inpth = "/tmp/gen_data_3cols_small"
> spark
> .read
> .option("inferSchema", "true")
> .option("header", "true")
> .csv(inpth)
> .repartition($"userId")
> .sortWithinPartitions("timestamp")
> .write
> .partitionBy("userId")
> .option("header", "true")
> .csv(inpth + "_sorted")
> {code}
> This issue is not seen when using a smaller sized dataset by making the time 
> span smaller (354MB, with the same number of columns).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19352) Sorting issues on relatively big datasets

2017-02-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19352.
-
Resolution: Not A Problem

> Sorting issues on relatively big datasets
> -
>
> Key: SPARK-19352
> URL: https://issues.apache.org/jira/browse/SPARK-19352
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: Spark version 2.1.0
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
> macOS 10.12.3
>Reporter: Ivan Gozali
>
> _More details, including the script to generate the synthetic dataset 
> (requires pandas and numpy) are in this GitHub gist._
> https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f
> Given a relatively large synthetic time series dataset of various users 
> (4.1GB), when attempting to:
> * partition this dataset by user ID
> * sort the time series data for each user by timestamp
> * write each partition to a single CSV file
> then some files are unsorted in a very specific manner. In one of the 
> supposedly sorted files, the rows looked as follows:
> {code}
> 2014-01-01T00:00:00.000-08:00,-0.07,0.39,-0.39
> 2014-12-31T02:07:30.000-08:00,0.34,-0.62,-0.22
> 2014-01-01T00:00:05.000-08:00,-0.07,-0.52,0.47
> 2014-12-31T02:07:35.000-08:00,-0.15,-0.13,-0.14
> 2014-01-01T00:00:10.000-08:00,-1.31,-1.17,2.24
> 2014-12-31T02:07:40.000-08:00,-1.28,0.88,-0.43
> {code}
> The above is attempted using the following Scala/Spark code:
> {code}
> val inpth = "/tmp/gen_data_3cols_small"
> spark
> .read
> .option("inferSchema", "true")
> .option("header", "true")
> .csv(inpth)
> .repartition($"userId")
> .sortWithinPartitions("timestamp")
> .write
> .partitionBy("userId")
> .option("header", "true")
> .csv(inpth + "_sorted")
> {code}
> This issue is not seen when using a smaller sized dataset by making the time 
> span smaller (354MB, with the same number of columns).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19352) Sorting issues on relatively big datasets

2017-02-24 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884016#comment-15884016
 ] 

Liang-Chi Hsieh commented on SPARK-19352:
-

I think this is in fact solved by SPARK-19563. [~cloud_fan]

> Sorting issues on relatively big datasets
> -
>
> Key: SPARK-19352
> URL: https://issues.apache.org/jira/browse/SPARK-19352
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: Spark version 2.1.0
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
> macOS 10.12.3
>Reporter: Ivan Gozali
>
> _More details, including the script to generate the synthetic dataset 
> (requires pandas and numpy) are in this GitHub gist._
> https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f
> Given a relatively large synthetic time series dataset of various users 
> (4.1GB), when attempting to:
> * partition this dataset by user ID
> * sort the time series data for each user by timestamp
> * write each partition to a single CSV file
> then some files are unsorted in a very specific manner. In one of the 
> supposedly sorted files, the rows looked as follows:
> {code}
> 2014-01-01T00:00:00.000-08:00,-0.07,0.39,-0.39
> 2014-12-31T02:07:30.000-08:00,0.34,-0.62,-0.22
> 2014-01-01T00:00:05.000-08:00,-0.07,-0.52,0.47
> 2014-12-31T02:07:35.000-08:00,-0.15,-0.13,-0.14
> 2014-01-01T00:00:10.000-08:00,-1.31,-1.17,2.24
> 2014-12-31T02:07:40.000-08:00,-1.28,0.88,-0.43
> {code}
> The above is attempted using the following Scala/Spark code:
> {code}
> val inpth = "/tmp/gen_data_3cols_small"
> spark
> .read
> .option("inferSchema", "true")
> .option("header", "true")
> .csv(inpth)
> .repartition($"userId")
> .sortWithinPartitions("timestamp")
> .write
> .partitionBy("userId")
> .option("header", "true")
> .csv(inpth + "_sorted")
> {code}
> This issue is not seen when using a smaller sized dataset by making the time 
> span smaller (354MB, with the same number of columns).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19735) Remove HOLD_DDLTIME from Catalog APIs

2017-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884012#comment-15884012
 ] 

Apache Spark commented on SPARK-19735:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17063

> Remove HOLD_DDLTIME from Catalog APIs
> -
>
> Key: SPARK-19735
> URL: https://issues.apache.org/jira/browse/SPARK-19735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> As explained in Hive JIRA https://issues.apache.org/jira/browse/HIVE-12224,  
> HOLD_DDLTIME was broken as soon as it landed. Hive 2.0 removes HOLD_DDLTIME 
> from the API. In Spark SQL, we always set it to FALSE. Like Hive, we should 
> also remove it from our Catalog APIs. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19735) Remove HOLD_DDLTIME from Catalog APIs

2017-02-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19735:


Assignee: Xiao Li  (was: Apache Spark)

> Remove HOLD_DDLTIME from Catalog APIs
> -
>
> Key: SPARK-19735
> URL: https://issues.apache.org/jira/browse/SPARK-19735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> As explained in Hive JIRA https://issues.apache.org/jira/browse/HIVE-12224,  
> HOLD_DDLTIME was broken as soon as it landed. Hive 2.0 removes HOLD_DDLTIME 
> from the API. In Spark SQL, we always set it to FALSE. Like Hive, we should 
> also remove it from our Catalog APIs. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19735) Remove HOLD_DDLTIME from Catalog APIs

2017-02-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19735:


Assignee: Apache Spark  (was: Xiao Li)

> Remove HOLD_DDLTIME from Catalog APIs
> -
>
> Key: SPARK-19735
> URL: https://issues.apache.org/jira/browse/SPARK-19735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> As explained in Hive JIRA https://issues.apache.org/jira/browse/HIVE-12224,  
> HOLD_DDLTIME was broken as soon as it landed. Hive 2.0 removes HOLD_DDLTIME 
> from the API. In Spark SQL, we always set it to FALSE. Like Hive, we should 
> also remove it from our Catalog APIs. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19735) Remove HOLD_DDLTIME from Catalog APIs

2017-02-24 Thread Xiao Li (JIRA)
Xiao Li created SPARK-19735:
---

 Summary: Remove HOLD_DDLTIME from Catalog APIs
 Key: SPARK-19735
 URL: https://issues.apache.org/jira/browse/SPARK-19735
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li
Assignee: Xiao Li


As explained in Hive JIRA https://issues.apache.org/jira/browse/HIVE-12224,  
HOLD_DDLTIME was broken as soon as it landed. Hive 2.0 removes HOLD_DDLTIME 
from the API. In Spark SQL, we always set it to FALSE. Like Hive, we should 
also remove it from our Catalog APIs. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2017-02-24 Thread Lin Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884007#comment-15884007
 ] 

Lin Ma edited comment on SPARK-18281 at 2/25/17 4:13 AM:
-

Is this bug really resolved? I am using the latest 2.1.0 release and having the 
same timeout behaviour as [~mwdus...@us.ibm.com] described. When using the 
iterator in local mode it works fine but as soon as moving to cluster it will 
timeout. I also tested with Mike's example and was able to validate it. Can 
someone point me to a fix or an alternative with similar functionality?

Thanks!


was (Author: realno):
Is this bug really resolved? I am using the latest 2.1.0 release and having the 
same timeout behaviour as [~mwdus...@us.ibm.com] described. When using the 
iterator in local mode it works fine but as soon as moving to the cluster it 
will timeout. I also tested with Mike's example and was able to validate it. 
Can someone point me to a fix or an alternative with similar functionality?

Thanks!

> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.3, 2.1.1
>
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2017-02-24 Thread Lin Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884007#comment-15884007
 ] 

Lin Ma commented on SPARK-18281:


Is this bug really resolved? I am using the latest 2.1.0 release and having the 
same timeout behaviour as [~mwdus...@us.ibm.com] described. When using the 
iterator in local mode it works fine but as soon as moving to the cluster it 
will timeout. I also tested with Mike's example and was able to validate it. 
Can someone point me to a fix or an alternative with similar functionality?

Thanks!

> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.3, 2.1.1
>
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14079) Limit the number of queries on SQL UI

2017-02-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14079.
---
Resolution: Not A Problem

> Limit the number of queries on SQL UI
> -
>
> Key: SPARK-14079
> URL: https://issues.apache.org/jira/browse/SPARK-14079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> The SQL UI become very very slow if there are hundreds of SQL queries on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14079) Limit the number of queries on SQL UI

2017-02-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883968#comment-15883968
 ] 

Hyukjin Kwon commented on SPARK-14079:
--

I am adding a link. Please correct this if wrong.

> Limit the number of queries on SQL UI
> -
>
> Key: SPARK-14079
> URL: https://issues.apache.org/jira/browse/SPARK-14079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> The SQL UI become very very slow if there are hundreds of SQL queries on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14079) Limit the number of queries on SQL UI

2017-02-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883966#comment-15883966
 ] 

Hyukjin Kwon edited comment on SPARK-14079 at 2/25/17 2:47 AM:
---

[~zsxwing], I am just curious if this JIRA is resolvable then.


was (Author: hyukjin.kwon):
[~shixi...@databricks.com], I am just curious if this JIRA is resolvable then.

> Limit the number of queries on SQL UI
> -
>
> Key: SPARK-14079
> URL: https://issues.apache.org/jira/browse/SPARK-14079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> The SQL UI become very very slow if there are hundreds of SQL queries on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14079) Limit the number of queries on SQL UI

2017-02-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883966#comment-15883966
 ] 

Hyukjin Kwon commented on SPARK-14079:
--

[~shixi...@databricks.com], I am just curious if this JIRA is resolvable then.

> Limit the number of queries on SQL UI
> -
>
> Key: SPARK-14079
> URL: https://issues.apache.org/jira/browse/SPARK-14079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> The SQL UI become very very slow if there are hundreds of SQL queries on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19734) OneHotEncoder __init__ uses dropLast but doc strings all say includeFirst

2017-02-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883953#comment-15883953
 ] 

Sean Owen commented on SPARK-19734:
---

Agreed, feel free to open a PR to fix it.

> OneHotEncoder __init__ uses dropLast but doc strings all say includeFirst
> -
>
> Key: SPARK-19734
> URL: https://issues.apache.org/jira/browse/SPARK-19734
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 1.5.2, 1.6.3, 2.0.2, 2.1.0
>Reporter: Corey
>Priority: Minor
>  Labels: documentation, easyfix
>
> The {{OneHotEncoder.__init__}} doc string in PySpark has an input keyword 
> listed as {{includeFirst}}, whereas the code actually uses {{dropLast}}.
> This especially confusing because the {{__init__}} function accepts only 
> keywords, and following the documentation on the web 
> (https://spark.apache.org/docs/2.0.1/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder)
>  or of {{help}} in Python will result in the error:
> {quote}
> TypeError: __init__() got an unexpected keyword argument 'includeFirst'
> {quote}
> The error is immediately viewable in the source code:
> {code}
> @keyword_only
> def __init__(self, dropLast=True, inputCol=None, outputCol=None):
> """
> __init__(self, includeFirst=True, inputCol=None, outputCol=None)
> """
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2017-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883951#comment-15883951
 ] 

Apache Spark commented on SPARK-17495:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/17062

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2017-02-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13446:


Assignee: Apache Spark

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>Assignee: Apache Spark
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2017-02-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13446:


Assignee: (was: Apache Spark)

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2017-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883943#comment-15883943
 ] 

Apache Spark commented on SPARK-13446:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17061

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19725) different parquet dependency in spark2.x and Hive2.x cause failure of HoS when using parquet file format

2017-02-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19725.
---
Resolution: Not A Problem

Hive 2 isn't supported, is it?
Spark is already on Parquet 1.8.

> different parquet dependency in spark2.x and Hive2.x cause failure of HoS 
> when using parquet file format
> 
>
> Key: SPARK-19725
> URL: https://issues.apache.org/jira/browse/SPARK-19725
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
> hive2.2
> hadoop2.7.1
>Reporter: KaiXu
>
> the parquet version in hive2.x is 1.8.1 while in spark2.x is 1.7.0, so when 
> run HoS queries using parquet file format would encounter some jars conflict 
> problems:
> Starting Spark Job = d1f6825c-48ea-45b8-9614-4266f2d1f0bd
> Job failed with java.lang.NoSuchMethodError: 
> org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder;
> FAILED: Execution Error, return code 3 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask. 
> java.util.concurrent.ExecutionException: Exception thrown by job
> at 
> org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:272)
> at 
> org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:362)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in 
> stage 1.0 (TID 9, hsx-node7): java.lang.RuntimeException: Error processing 
> row: java.lang.NoSuchMethodError: 
> org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder;
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:149)
> at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
> at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
> at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
> at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
> at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
> at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1976)
> at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1976)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder;
> at 
> org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertType(HiveSchemaConverter.java:100)
> at 
> org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertType(HiveSchemaConverter.java:56)
> at 
> org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertTypes(HiveSchemaConverter.java:50)
> at 
> org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convert(HiveSchemaConverter.java:39)
> at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:115)
> at 
> 

[jira] [Created] (SPARK-19734) OneHotEncoder __init__ uses dropLast but doc strings all say includeFirst

2017-02-24 Thread Corey (JIRA)
Corey created SPARK-19734:
-

 Summary: OneHotEncoder __init__ uses dropLast but doc strings all 
say includeFirst
 Key: SPARK-19734
 URL: https://issues.apache.org/jira/browse/SPARK-19734
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 2.1.0, 2.0.2, 1.6.3, 1.5.2
Reporter: Corey
Priority: Minor


The {{OneHotEncoder.__init__}} doc string in PySpark has an input keyword 
listed as {{includeFirst}}, whereas the code actually uses {{dropLast}}.

This especially confusing because the {{__init__}} function accepts only 
keywords, and following the documentation on the web 
(https://spark.apache.org/docs/2.0.1/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder)
 or of {{help}} in Python will result in the error:
{quote}
TypeError: __init__() got an unexpected keyword argument 'includeFirst'
{quote}

The error is immediately viewable in the source code:
{code}
@keyword_only
def __init__(self, dropLast=True, inputCol=None, outputCol=None):
"""
__init__(self, includeFirst=True, inputCol=None, outputCol=None)
"""
{code}





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19733) ALS performs unnecessary casting on item and user ids

2017-02-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883800#comment-15883800
 ] 

Apache Spark commented on SPARK-19733:
--

User 'datumbox' has created a pull request for this issue:
https://github.com/apache/spark/pull/17059

> ALS performs unnecessary casting on item and user ids
> -
>
> Key: SPARK-19733
> URL: https://issues.apache.org/jira/browse/SPARK-19733
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.1.0
>Reporter: Vasilis Vryniotis
>
> The ALS is performing unnecessary casting to the user and item ids (to 
> double). I believe this is because the protected checkedCast() method 
> requires a double input. This can be avoided by refactroing the code of 
> checkedCast method.
> Issue resolved by pull-request 17059:
> https://github.com/apache/spark/pull/17059



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19733) ALS performs unnecessary casting on item and user ids

2017-02-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19733:


Assignee: (was: Apache Spark)

> ALS performs unnecessary casting on item and user ids
> -
>
> Key: SPARK-19733
> URL: https://issues.apache.org/jira/browse/SPARK-19733
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.1.0
>Reporter: Vasilis Vryniotis
>
> The ALS is performing unnecessary casting to the user and item ids (to 
> double). I believe this is because the protected checkedCast() method 
> requires a double input. This can be avoided by refactroing the code of 
> checkedCast method.
> Issue resolved by pull-request 17059:
> https://github.com/apache/spark/pull/17059



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19733) ALS performs unnecessary casting on item and user ids

2017-02-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19733:


Assignee: Apache Spark

> ALS performs unnecessary casting on item and user ids
> -
>
> Key: SPARK-19733
> URL: https://issues.apache.org/jira/browse/SPARK-19733
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.1.0
>Reporter: Vasilis Vryniotis
>Assignee: Apache Spark
>
> The ALS is performing unnecessary casting to the user and item ids (to 
> double). I believe this is because the protected checkedCast() method 
> requires a double input. This can be avoided by refactroing the code of 
> checkedCast method.
> Issue resolved by pull-request 17059:
> https://github.com/apache/spark/pull/17059



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19733) ALS performs unnecessary casting on item and user ids

2017-02-24 Thread Vasilis Vryniotis (JIRA)
Vasilis Vryniotis created SPARK-19733:
-

 Summary: ALS performs unnecessary casting on item and user ids
 Key: SPARK-19733
 URL: https://issues.apache.org/jira/browse/SPARK-19733
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.1.0, 2.0.1, 2.0.0, 1.6.3
Reporter: Vasilis Vryniotis


The ALS is performing unnecessary casting to the user and item ids (to double). 
I believe this is because the protected checkedCast() method requires a double 
input. This can be avoided by refactroing the code of checkedCast method.

Issue resolved by pull-request 17059:
https://github.com/apache/spark/pull/17059



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15355) Pro-active block replenishment in case of node/executor failures

2017-02-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-15355:
---

Assignee: Shubham Chopra

> Pro-active block replenishment in case of node/executor failures
> 
>
> Key: SPARK-15355
> URL: https://issues.apache.org/jira/browse/SPARK-15355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Shubham Chopra
>Assignee: Shubham Chopra
> Fix For: 2.2.0
>
>
> Spark currently does not replenish lost replicas. For resiliency and high 
> availability, BlockManagerMasterEndpoint can proactively verify whether all 
> cached RDDs have enough replicas, and replenish them, in case they don’t.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15355) Pro-active block replenishment in case of node/executor failures

2017-02-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15355.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 14412
[https://github.com/apache/spark/pull/14412]

> Pro-active block replenishment in case of node/executor failures
> 
>
> Key: SPARK-15355
> URL: https://issues.apache.org/jira/browse/SPARK-15355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Shubham Chopra
> Fix For: 2.2.0
>
>
> Spark currently does not replenish lost replicas. For resiliency and high 
> availability, BlockManagerMasterEndpoint can proactively verify whether all 
> cached RDDs have enough replicas, and replenish them, in case they don’t.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13330) PYTHONHASHSEED is not propgated to python worker

2017-02-24 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-13330:
---

Assignee: Jeff Zhang

> PYTHONHASHSEED is not propgated to python worker
> 
>
> Key: SPARK-13330
> URL: https://issues.apache.org/jira/browse/SPARK-13330
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 2.2.0
>
>
> when using python 3.3 , PYTHONHASHSEED is only set in driver, but not 
> propagated to executor, and cause the following error.
> {noformat}
>   File "/Users/jzhang/github/spark/python/pyspark/rdd.py", line 74, in 
> portable_hash
> raise Exception("Randomness of hash of string should be disabled via 
> PYTHONHASHSEED")
> Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13330) PYTHONHASHSEED is not propgated to python worker

2017-02-24 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-13330.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 11211
[https://github.com/apache/spark/pull/11211]

> PYTHONHASHSEED is not propgated to python worker
> 
>
> Key: SPARK-13330
> URL: https://issues.apache.org/jira/browse/SPARK-13330
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
> Fix For: 2.2.0
>
>
> when using python 3.3 , PYTHONHASHSEED is only set in driver, but not 
> propagated to executor, and cause the following error.
> {noformat}
>   File "/Users/jzhang/github/spark/python/pyspark/rdd.py", line 74, in 
> portable_hash
> raise Exception("Randomness of hash of string should be disabled via 
> PYTHONHASHSEED")
> Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13947) PySpark DataFrames: The error message from using an invalid table reference is not clear

2017-02-24 Thread Ruben Berenguel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883688#comment-15883688
 ] 

Ruben Berenguel commented on SPARK-13947:
-

I'll give a shot to this one as a first dive into the Spark codebase. Wish me 
luck :)

> PySpark DataFrames: The error message from using an invalid table reference 
> is not clear
> 
>
> Key: SPARK-13947
> URL: https://issues.apache.org/jira/browse/SPARK-13947
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Wes McKinney
>
> {code}
> import numpy as np
> import pandas as pd
> df = pd.DataFrame({'foo': np.random.randn(1000),
>'bar': np.random.randn(1000)})
> df2 = pd.DataFrame({'foo': np.random.randn(1000),
> 'bar': np.random.randn(1000)})
> sdf = sqlContext.createDataFrame(df)
> sdf2 = sqlContext.createDataFrame(df2)
> sdf[sdf2.foo > 0]
> {code}
> Produces this error message:
> {code}
> AnalysisException: u'resolved attribute(s) foo#91 missing from bar#87,foo#88 
> in operator !Filter (foo#91 > cast(0 as double));'
> {code}
> It may be possible to make it more clear what the user did wrong. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14561) History Server does not see new logs in S3

2017-02-24 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-14561:
---
Component/s: Spark Core

> History Server does not see new logs in S3
> --
>
> Key: SPARK-14561
> URL: https://issues.apache.org/jira/browse/SPARK-14561
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Miles Crawford
>
> If you set the Spark history server to use a log directory with an s3a:// 
> url, everything appears to work fine at first, but new log files written by 
> applications are not picked up by the server.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19715) Option to Strip Paths in FileSource

2017-02-24 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883642#comment-15883642
 ] 

Michael Armbrust commented on SPARK-19715:
--

This isn't a hypothetical.  A user of structured streaming upgraded to {{s3a}} 
and was surprised to see duplicate computation in the results.  Their files are 
named with a combination of upload time and a GUID, so I don't think there is 
any risk for this use case.  That said, I would not make this option the 
default.

> Option to Strip Paths in FileSource
> ---
>
> Key: SPARK-19715
> URL: https://issues.apache.org/jira/browse/SPARK-19715
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>
> Today, we compare the whole path when deciding if a file is new in the 
> FileSource for structured streaming.  However, this cause cause false 
> negatives in the case where the path has changed in a cosmetic way (i.e. 
> changing s3n to s3a).  We should add an option {{fileNameOnly}} that causes 
> the new file check to be based only on the filename (but still store the 
> whole path in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14501) spark.ml parity for fpm - frequent items

2017-02-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883542#comment-15883542
 ] 

Joseph K. Bradley commented on SPARK-14501:
---

I set the target for Scala to 2.2.  Not sure if Python will make it.

> spark.ml parity for fpm - frequent items
> 
>
> Key: SPARK-14501
> URL: https://issues.apache.org/jira/browse/SPARK-14501
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is an umbrella for porting the spark.mllib.fpm subpackage to spark.ml.
> I am initially creating a single subtask, which will require a brief design 
> doc for the DataFrame-based API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14501) spark.ml parity for fpm - frequent items

2017-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14501:
--
Target Version/s:   (was: 2.2.0)

> spark.ml parity for fpm - frequent items
> 
>
> Key: SPARK-14501
> URL: https://issues.apache.org/jira/browse/SPARK-14501
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is an umbrella for porting the spark.mllib.fpm subpackage to spark.ml.
> I am initially creating a single subtask, which will require a brief design 
> doc for the DataFrame-based API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14503) spark.ml Scala API for FPGrowth

2017-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-14503:
-

Assignee: yuhao yang

> spark.ml Scala API for FPGrowth
> ---
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14503) spark.ml Scala API for FPGrowth

2017-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14503:
--
Target Version/s: 2.2.0

> spark.ml Scala API for FPGrowth
> ---
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14503) spark.ml Scala API for FPGrowth

2017-02-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14503:
--
Shepherd: Joseph K. Bradley  (was: Nick Pentreath)

> spark.ml Scala API for FPGrowth
> ---
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19732) DataFrame.fillna() does not work for bools in PySpark

2017-02-24 Thread Len Frodgers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883522#comment-15883522
 ] 

Len Frodgers edited comment on SPARK-19732 at 2/24/17 9:12 PM:
---

Actually there's another anomaly:
Spark (and pyspark) supports filling of bools if you specify the args as a map: 
{code}
fillna({"a": False})
{code}
, but not if you specify it as
{code}
fillna(False)
{code}

This is because (scala-)Spark has no
{code}
def fill(value: Boolean): DataFrame = fill(value, df.columns)
{code}
 method. I find that strange/buggy


was (Author: lenfrodge):
Actually there's another anomaly:
Spark (and pyspark) supports filling of bools if you specify the args as a map: 
fillna({"a": False}) , but not if you specify it as fillna(False)

This is because (scala-)Spark has no `def fill(value: Booloean): DataFrame = 
fill(value, df.columns)` method. I find that strange/buggy

> DataFrame.fillna() does not work for bools in PySpark
> -
>
> Key: SPARK-19732
> URL: https://issues.apache.org/jira/browse/SPARK-19732
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Len Frodgers
>
> In PySpark, the fillna function of DataFrame inadvertently casts bools to 
> ints, so fillna cannot be used to fill True/False.
> e.g. 
> `spark.createDataFrame([Row(a=True),Row(a=None)]).fillna(True).collect()` 
> yields
> `[Row(a=True), Row(a=None)]`
> It should be a=True for the second Row
> The cause is this bit of code: 
> {code}
> if isinstance(value, (int, long)):
> value = float(value)
> {code}
> There needs to be a separate check for isinstance(bool), since in python, 
> bools are ints too



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19732) DataFrame.fillna() does not work for bools in PySpark

2017-02-24 Thread Len Frodgers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883522#comment-15883522
 ] 

Len Frodgers commented on SPARK-19732:
--

Actually there's another anomaly:
Spark (and pyspark) supports filling of bools if you specify the args as a map: 
fillna({"a": False}) , but not if you specify it as fillna(False)

This is because (scala-)Spark has no `def fill(value: Booloean): DataFrame = 
fill(value, df.columns)` method. I find that strange/buggy

> DataFrame.fillna() does not work for bools in PySpark
> -
>
> Key: SPARK-19732
> URL: https://issues.apache.org/jira/browse/SPARK-19732
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Len Frodgers
>
> In PySpark, the fillna function of DataFrame inadvertently casts bools to 
> ints, so fillna cannot be used to fill True/False.
> e.g. 
> `spark.createDataFrame([Row(a=True),Row(a=None)]).fillna(True).collect()` 
> yields
> `[Row(a=True), Row(a=None)]`
> It should be a=True for the second Row
> The cause is this bit of code: 
> {code}
> if isinstance(value, (int, long)):
> value = float(value)
> {code}
> There needs to be a separate check for isinstance(bool), since in python, 
> bools are ints too



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19597) ExecutorSuite should have test for tasks that are not deserialiazable

2017-02-24 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19597.

   Resolution: Fixed
Fix Version/s: 2.2.0

> ExecutorSuite should have test for tasks that are not deserialiazable
> -
>
> Key: SPARK-19597
> URL: https://issues.apache.org/jira/browse/SPARK-19597
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.2.0
>
>
> We should have a test case that ensures that Executors gracefully handle a 
> task that fails to deserialize, by sending back a reasonable failure message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19732) DataFrame.fillna() does not work for bools in PySpark

2017-02-24 Thread Len Frodgers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Len Frodgers updated SPARK-19732:
-
Description: 
In PySpark, the fillna function of DataFrame inadvertently casts bools to ints, 
so fillna cannot be used to fill True/False.

e.g. `spark.createDataFrame([Row(a=True),Row(a=None)]).fillna(True).collect()` 
yields
`[Row(a=True), Row(a=None)]`
It should be a=True for the second Row

The cause is this bit of code: 
{code}
if isinstance(value, (int, long)):
value = float(value)
{code}

There needs to be a separate check for isinstance(bool), since in python, bools 
are ints too

  was:
In PySpark, the fillna function of DataFrame inadvertently casts bools to ints, 
so fillna cannot be used to fill True/False.

e.g. `spark.createDataFrame([Row(a=True),Row(a=None)]).fillna(True).collect()` 
yields
`[Row(a=True), Row(a=None)]`
It should be a=True for the second Row

The cause is this bit of code: 
if isinstance(value, (int, long)):
value = float(value)

There needs to be a separate check for isinstance(bool), since in python, bools 
are ints too


> DataFrame.fillna() does not work for bools in PySpark
> -
>
> Key: SPARK-19732
> URL: https://issues.apache.org/jira/browse/SPARK-19732
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Len Frodgers
>
> In PySpark, the fillna function of DataFrame inadvertently casts bools to 
> ints, so fillna cannot be used to fill True/False.
> e.g. 
> `spark.createDataFrame([Row(a=True),Row(a=None)]).fillna(True).collect()` 
> yields
> `[Row(a=True), Row(a=None)]`
> It should be a=True for the second Row
> The cause is this bit of code: 
> {code}
> if isinstance(value, (int, long)):
> value = float(value)
> {code}
> There needs to be a separate check for isinstance(bool), since in python, 
> bools are ints too



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19732) DataFrame.fillna() does not work for bools in PySpark

2017-02-24 Thread Len Frodgers (JIRA)
Len Frodgers created SPARK-19732:


 Summary: DataFrame.fillna() does not work for bools in PySpark
 Key: SPARK-19732
 URL: https://issues.apache.org/jira/browse/SPARK-19732
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: Len Frodgers


In PySpark, the fillna function of DataFrame inadvertently casts bools to ints, 
so fillna cannot be used to fill True/False.

e.g. `spark.createDataFrame([Row(a=True),Row(a=None)]).fillna(True).collect()` 
yields
`[Row(a=True), Row(a=None)]`
It should be a=True for the second Row

The cause is this bit of code: 
if isinstance(value, (int, long)):
value = float(value)

There needs to be a separate check for isinstance(bool), since in python, bools 
are ints too



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19731) IN Operator should support arrays

2017-02-24 Thread Shawn Lavelle (JIRA)
Shawn Lavelle created SPARK-19731:
-

 Summary: IN Operator should support arrays
 Key: SPARK-19731
 URL: https://issues.apache.org/jira/browse/SPARK-19731
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0, 2.0.0, 1.6.2
Reporter: Shawn Lavelle
Priority: Minor


When the column type and array member type match, the IN operator should still 
operate on the array. This is useful for UDFs and Predicate SubQueries that 
return arrays.  

(This isn't necessarily extensible to all collections, but certainly applies to 
arrays.)

Example:
select 5 in array(1,2,3) Should return false instead of parseException, since 
the type of the array and the type of the column match.

create table test (val int);
insert into test values (1);
select * from test;
+--+--+
| val  |
+--+--+
| 1|
+--+--+
*select val from test where array_contains(array(1,2,3), val);*
+--+--+
| val  |
+--+--+
| 1|
+--+--+

{panel}
*select val from test where val in (array(1,2,3));*
Error: org.apache.spark.sql.AnalysisException: cannot resolve '(test.`val` IN 
(array(1, 2, 3)))' due to data type mismatch: Arguments must be same type; line 
1 pos 31;
'Project ['val]
+- 'Filter val#433 IN (array(1, 2, 3))
   +- MetastoreRelation test (state=,code=0)
{panel}

{panel}
*select val from test where val in (select array(1,2,3));*
Error: org.apache.spark.sql.AnalysisException: cannot resolve '(test.`val` = 
`array(1, 2, 3)`)' due to data type mismatch: differing types in '(test.`val` = 
`array(1, 2, 3)`)' (int and array).;;
'Project ['val]
+- 'Filter predicate-subquery#434 [(val#435 = array(1, 2, 3)#436)]
   :  +- Project [array(1, 2, 3) AS array(1, 2, 3)#436]
   : +- OneRowRelation$
   +- MetastoreRelation test (state=,code=0)
{panel}
{panel}
*select val from test where val in (select explode(array(1,2,3)));*
+--+--+
| val  |
+--+--+
| 1|
+--+--+

Note: See [SPARK-19730|https://issues.apache.org/jira/browse/SPARK-19730] for 
how a predicate subquery breaks when applied to the DataSourceAPI
{panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-24 Thread Roberto Mirizzi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883431#comment-15883431
 ] 

Roberto Mirizzi commented on SPARK-14409:
-

[~mlnick] my implementation was conceptually close to what we already have for 
the existing mllib. If you look at the example in 
http://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#ranking-systems
 they do exactly what I do with goodThreshold parameter.
As you can see in my approach, I am using collect_list and windowing, and I 
simply pass the Dataset to the evaluator, similar to what we have for other 
evaluators in ml.
IMO, that's the approach that has continuity with other existing evaluators. 
However, if you think we should also support array columns, we can add that too.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-24 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883417#comment-15883417
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] for the reminder. I will take a look and update the PR as 
needed. (I am on the road until next Wednesday. Will try to get it by the end 
of next week.)

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4681) Turn on executor level blacklisting by default

2017-02-24 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout closed SPARK-4681.
-
Resolution: Duplicate

This was for the old blacklisting mechanism.  The linked JIRAs introduce a new 
blacklisting mechanism that should eventually be enabled by default, but are 
currently considered experimental.

> Turn on executor level blacklisting by default
> --
>
> Key: SPARK-4681
> URL: https://issues.apache.org/jira/browse/SPARK-4681
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Patrick Wendell
>Assignee: Kay Ousterhout
>
> Per discussion in https://github.com/apache/spark/pull/3541.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19730) Predicate Subqueries do not push results of subqueries to data source

2017-02-24 Thread Shawn Lavelle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Lavelle updated SPARK-19730:
--
Description: 
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a filter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.

{panel:title=Physical Plan}
*Project [point#263, value#270]
+- SortMergeJoin [point#263], [col#284], LeftSemi
   :- *Sort [point#263 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(point#263, 20)
   : +- *Filter ((time#264L >= 1487964691000) && (time#264L <= 
1487964696000))
   :+- \*Scan @4b455128 DBNAME.data[ point#263, time#264Lvalue#270] 
*PushedFilters: [GreaterThanOrEqual(time,1487964691000), 
LessThanOrEqual(time,1487964691000)*, ReadSchema: 
struct...
   +- *Sort [col#284 ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(col#284, 20)
 +- Generate explode(points#273), false, false, [col#284]
+- *Project [points#273]
   +- *Filter (group_id#272 = 1)
  +- *Scan @12fb3c1a .groups[points#273,group_id#272] 
PushedFilters: [EqualTo(group_id,1)], ReadSchema: struct  |
{panel}

  was:
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.

{panel:title=Physical Plan}
*Project [point#263, value#270]
+- SortMergeJoin [point#263], [col#284], LeftSemi
   :- *Sort [point#263 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(point#263, 20)
   : +- *Filter ((time#264L >= 1487964691000) && (time#264L <= 
1487964696000))
   :+- \*Scan @4b455128 DBNAME.data[ point#263, time#264Lvalue#270] 
*PushedFilters: [GreaterThanOrEqual(time,1487964691000), 
LessThanOrEqual(time,1487964691000)*, ReadSchema: 
struct...
   +- *Sort [col#284 ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(col#284, 20)
 +- Generate explode(points#273), false, false, [col#284]
+- *Project [points#273]
   +- *Filter (group_id#272 = 1)
  +- *Scan @12fb3c1a .groups[points#273,group_id#272] 
PushedFilters: [EqualTo(group_id,1)], ReadSchema: struct  |
{panel}


> Predicate Subqueries do not push results of subqueries to data source
> -
>
> Key: SPARK-19730
> URL: https://issues.apache.org/jira/browse/SPARK-19730
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Shawn Lavelle
>
> When a SparkSQL query contains a subquery in the where clause, such as a 
> predicate query using the IN operator, the results of that subquery are not 
> pushed down as a filter to the DataSourceAPI for the outer query. 
> Example: 
> Select point, time, value from data where time between now()-86400 and now() 
> and point in (select point from groups where group_id=5);
> Two queries will be sent to the data Source.  One for the subquery, and 
> another for the outer query. The 

[jira] [Closed] (SPARK-19560) Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask from a failed executor

2017-02-24 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout closed SPARK-19560.
--
  Resolution: Fixed
Target Version/s: 2.2.0

> Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask 
> from a failed executor
> 
>
> Key: SPARK-19560
> URL: https://issues.apache.org/jira/browse/SPARK-19560
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> There's some tricky code around the case when the DAGScheduler learns of a 
> ShuffleMapTask that completed successfully, but ran on an executor that 
> failed sometime after the task was launched.  This case is tricky because the 
> TaskSetManager (i.e., the lower level scheduler) thinks the task completed 
> successfully, but the DAGScheduler considers the output it generated to be no 
> longer valid (because it was probably lost when the executor was lost).  As a 
> result, the DAGScheduler needs to re-submit the stage, so that the task can 
> be re-run.  This is tested in some of the tests but not clearly documented, 
> so we should improve this to prevent future bugs (this was encountered by 
> [~markhamstra] in attempting to find a better fix for SPARK-19263).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19730) Predicate Subqueries do not push results of subqueries to data source

2017-02-24 Thread Shawn Lavelle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Lavelle updated SPARK-19730:
--
Description: 
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.

{panel:title=Physical Plan}
*Project [point#263, value#270]
+- SortMergeJoin [point#263], [col#284], LeftSemi
   :- *Sort [point#263 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(point#263, 20)
   : +- *Filter ((time#264L >= 1487964691000) && (time#264L <= 
1487964696000))
   :+- *Scan @4b455128 DBNAME.data[ point#263, time#264Lvalue#270] 
*PushedFilters: [GreaterThanOrEqual(time,1487964691000), 
LessThanOrEqual(time,1487964691000)*, ReadSchema: 
struct...
   +- *Sort [col#284 ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(col#284, 20)
 +- Generate explode(points#273), false, false, [col#284]
+- *Project [points#273]
   +- *Filter (group_id#272 = 1)
  +- *Scan @12fb3c1a .groups[points#273,group_id#272] 
PushedFilters: [EqualTo(group_id,1)], ReadSchema: struct  |
{panel}

  was:
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.

{panel:title=Physical Plan}
*Project [point#263, value#270]
+- SortMergeJoin [point#263], [col#284], LeftSemi
   :- *Sort [point#263 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(point#263, 20)
   : +- *Filter ((time#264L >= 1487964691000) && (time#264L <= 
1487964696000))
   :+- *Scan @4b455128 .data[point#263,time#264Lvalue#270] 
*PushedFilters: [GreaterThanOrEqual(time,1487964691000), 
LessThanOrEqual(time,1487964691000)*, ReadSchema: 
struct...
   +- *Sort [col#284 ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(col#284, 20)
 +- Generate explode(points#273), false, false, [col#284]
+- *Project [points#273]
   +- *Filter (group_id#272 = 1)
  +- *Scan @12fb3c1a .groups[points#273,group_id#272] 
PushedFilters: [EqualTo(group_id,1)], ReadSchema: struct  |
{panel}


> Predicate Subqueries do not push results of subqueries to data source
> -
>
> Key: SPARK-19730
> URL: https://issues.apache.org/jira/browse/SPARK-19730
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Shawn Lavelle
>
> When a SparkSQL query contains a subquery in the where clause, such as a 
> predicate query using the IN operator, the results of that subquery are not 
> pushed down as a fileter to the DataSourceAPI for the outer query. 
> Example: 
> Select point, time, value from data where time between now()-86400 and now() 
> and point in (select point from groups where group_id=5);
> Two queries will be sent to the data Source.  One for the subquery, and 
> another for the outer query. The subquery works 

[jira] [Updated] (SPARK-19730) Predicate Subqueries do not push results of subqueries to data source

2017-02-24 Thread Shawn Lavelle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Lavelle updated SPARK-19730:
--
Description: 
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.

{panel:title=Physical Plan}
*Project [point#263, value#270]
+- SortMergeJoin [point#263], [col#284], LeftSemi
   :- *Sort [point#263 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(point#263, 20)
   : +- *Filter ((time#264L >= 1487964691000) && (time#264L <= 
1487964696000))
   :+- *Scan @4b455128 DBNAME.data[ point#263, time#264Lvalue#270] 
**PushedFilters: [GreaterThanOrEqual(time,1487964691000), 
LessThanOrEqual(time,1487964691000)**, ReadSchema: 
struct...
   +- *Sort [col#284 ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(col#284, 20)
 +- Generate explode(points#273), false, false, [col#284]
+- *Project [points#273]
   +- *Filter (group_id#272 = 1)
  +- *Scan @12fb3c1a .groups[points#273,group_id#272] 
PushedFilters: [EqualTo(group_id,1)], ReadSchema: struct  |
{panel}

  was:
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.

{panel:title=Physical Plan}
*Project [point#263, value#270]
+- SortMergeJoin [point#263], [col#284], LeftSemi
   :- *Sort [point#263 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(point#263, 20)
   : +- *Filter ((time#264L >= 1487964691000) && (time#264L <= 
1487964696000))
   :+- *Scan @4b455128 DBNAME.data[ point#263, time#264Lvalue#270] 
*PushedFilters: [GreaterThanOrEqual(time,1487964691000), 
LessThanOrEqual(time,1487964691000)*, ReadSchema: 
struct...
   +- *Sort [col#284 ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(col#284, 20)
 +- Generate explode(points#273), false, false, [col#284]
+- *Project [points#273]
   +- *Filter (group_id#272 = 1)
  +- *Scan @12fb3c1a .groups[points#273,group_id#272] 
PushedFilters: [EqualTo(group_id,1)], ReadSchema: struct  |
{panel}


> Predicate Subqueries do not push results of subqueries to data source
> -
>
> Key: SPARK-19730
> URL: https://issues.apache.org/jira/browse/SPARK-19730
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Shawn Lavelle
>
> When a SparkSQL query contains a subquery in the where clause, such as a 
> predicate query using the IN operator, the results of that subquery are not 
> pushed down as a fileter to the DataSourceAPI for the outer query. 
> Example: 
> Select point, time, value from data where time between now()-86400 and now() 
> and point in (select point from groups where group_id=5);
> Two queries will be sent to the data Source.  One for the subquery, and 
> another for the outer query. The 

[jira] [Commented] (SPARK-17495) Hive hash implementation

2017-02-24 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883394#comment-15883394
 ] 

Reynold Xin commented on SPARK-17495:
-

Let me put some thoughts here  Please let me know if I missed anything:

1. On the read side we shouldn't care which hash function to use. All we need 
to know is that the data is hash partitioned by some hash function, and that 
should be sufficient to remove the shuffle needed in aggregation or join.

2. On the write side it does matter. In this case if we are writing to a Hive 
bucketed table, the Hive hash function should be used. Otherwise a Spark hash 
function should be used. This can perhaps be an option in the writer interface, 
and automatically populated for catalog tables based on what kind of table it 
is.

3. In general it'd be useful to allow users to configure which actual hash 
function "hash" maps to. This can be a dynamic config.





> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19730) Predicate Subqueries do not push results of subqueries to data source

2017-02-24 Thread Shawn Lavelle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Lavelle updated SPARK-19730:
--
Description: 
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.

{panel:title=Physical Plan}
*Project [point#263, value#270]
+- SortMergeJoin [point#263], [col#284], LeftSemi
   :- *Sort [point#263 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(point#263, 20)
   : +- *Filter ((time#264L >= 1487964691000) && (time#264L <= 
1487964696000))
   :+- \*Scan @4b455128 DBNAME.data[ point#263, time#264Lvalue#270] 
**PushedFilters: [GreaterThanOrEqual(time,1487964691000), 
LessThanOrEqual(time,1487964691000)**, ReadSchema: 
struct...
   +- *Sort [col#284 ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(col#284, 20)
 +- Generate explode(points#273), false, false, [col#284]
+- *Project [points#273]
   +- *Filter (group_id#272 = 1)
  +- *Scan @12fb3c1a .groups[points#273,group_id#272] 
PushedFilters: [EqualTo(group_id,1)], ReadSchema: struct  |
{panel}

  was:
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.

{panel:title=Physical Plan}
*Project [point#263, value#270]
+- SortMergeJoin [point#263], [col#284], LeftSemi
   :- *Sort [point#263 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(point#263, 20)
   : +- *Filter ((time#264L >= 1487964691000) && (time#264L <= 
1487964696000))
   :+- *Scan @4b455128 DBNAME.data[ point#263, time#264Lvalue#270] 
**PushedFilters: [GreaterThanOrEqual(time,1487964691000), 
LessThanOrEqual(time,1487964691000)**, ReadSchema: 
struct...
   +- *Sort [col#284 ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(col#284, 20)
 +- Generate explode(points#273), false, false, [col#284]
+- *Project [points#273]
   +- *Filter (group_id#272 = 1)
  +- *Scan @12fb3c1a .groups[points#273,group_id#272] 
PushedFilters: [EqualTo(group_id,1)], ReadSchema: struct  |
{panel}


> Predicate Subqueries do not push results of subqueries to data source
> -
>
> Key: SPARK-19730
> URL: https://issues.apache.org/jira/browse/SPARK-19730
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Shawn Lavelle
>
> When a SparkSQL query contains a subquery in the where clause, such as a 
> predicate query using the IN operator, the results of that subquery are not 
> pushed down as a fileter to the DataSourceAPI for the outer query. 
> Example: 
> Select point, time, value from data where time between now()-86400 and now() 
> and point in (select point from groups where group_id=5);
> Two queries will be sent to the data Source.  One for the subquery, and 
> another for the outer query. The 

[jira] [Updated] (SPARK-19730) Predicate Subqueries do not push results of subqueries to data source

2017-02-24 Thread Shawn Lavelle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Lavelle updated SPARK-19730:
--
Description: 
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.

{panel:title=Physical Plan}
*Project [point#263, value#270]
+- SortMergeJoin [point#263], [col#284], LeftSemi
   :- *Sort [point#263 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(point#263, 20)
   : +- *Filter ((time#264L >= 1487964691000) && (time#264L <= 
1487964696000))
   :+- \*Scan @4b455128 DBNAME.data[ point#263, time#264Lvalue#270] 
*PushedFilters: [GreaterThanOrEqual(time,1487964691000), 
LessThanOrEqual(time,1487964691000)*, ReadSchema: 
struct...
   +- *Sort [col#284 ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(col#284, 20)
 +- Generate explode(points#273), false, false, [col#284]
+- *Project [points#273]
   +- *Filter (group_id#272 = 1)
  +- *Scan @12fb3c1a .groups[points#273,group_id#272] 
PushedFilters: [EqualTo(group_id,1)], ReadSchema: struct  |
{panel}

  was:
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.

{panel:title=Physical Plan}
*Project [point#263, value#270]
+- SortMergeJoin [point#263], [col#284], LeftSemi
   :- *Sort [point#263 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(point#263, 20)
   : +- *Filter ((time#264L >= 1487964691000) && (time#264L <= 
1487964696000))
   :+- \*Scan @4b455128 DBNAME.data[ point#263, time#264Lvalue#270] 
**PushedFilters: [GreaterThanOrEqual(time,1487964691000), 
LessThanOrEqual(time,1487964691000)**, ReadSchema: 
struct...
   +- *Sort [col#284 ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(col#284, 20)
 +- Generate explode(points#273), false, false, [col#284]
+- *Project [points#273]
   +- *Filter (group_id#272 = 1)
  +- *Scan @12fb3c1a .groups[points#273,group_id#272] 
PushedFilters: [EqualTo(group_id,1)], ReadSchema: struct  |
{panel}


> Predicate Subqueries do not push results of subqueries to data source
> -
>
> Key: SPARK-19730
> URL: https://issues.apache.org/jira/browse/SPARK-19730
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Shawn Lavelle
>
> When a SparkSQL query contains a subquery in the where clause, such as a 
> predicate query using the IN operator, the results of that subquery are not 
> pushed down as a fileter to the DataSourceAPI for the outer query. 
> Example: 
> Select point, time, value from data where time between now()-86400 and now() 
> and point in (select point from groups where group_id=5);
> Two queries will be sent to the data Source.  One for the subquery, and 
> another for the outer query. The 

[jira] [Updated] (SPARK-19730) Predicate Subqueries do not push results of subqueries to data source

2017-02-24 Thread Shawn Lavelle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Lavelle updated SPARK-19730:
--
Description: 
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.

{panel:title=Physical Plan}
*Project [point#263, value#270]
+- SortMergeJoin [point#263], [col#284], LeftSemi
   :- *Sort [point#263 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(point#263, 20)
   : +- *Filter ((time#264L >= 1487964691000) && (time#264L <= 
1487964696000))
   :+- *Scan @4b455128 .data[point#263,time#264Lvalue#270] 
*PushedFilters: [GreaterThanOrEqual(time,1487964691000), 
LessThanOrEqual(time,1487964691000)*, ReadSchema: 
struct...
   +- *Sort [col#284 ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(col#284, 20)
 +- Generate explode(points#273), false, false, [col#284]
+- *Project [points#273]
   +- *Filter (group_id#272 = 1)
  +- *Scan @12fb3c1a .groups[points#273,group_id#272] 
PushedFilters: [EqualTo(group_id,1)], ReadSchema: struct  |
{panel}

  was:
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.


> Predicate Subqueries do not push results of subqueries to data source
> -
>
> Key: SPARK-19730
> URL: https://issues.apache.org/jira/browse/SPARK-19730
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0
>Reporter: Shawn Lavelle
>
> When a SparkSQL query contains a subquery in the where clause, such as a 
> predicate query using the IN operator, the results of that subquery are not 
> pushed down as a fileter to the DataSourceAPI for the outer query. 
> Example: 
> Select point, time, value from data where time between now()-86400 and now() 
> and point in (select point from groups where group_id=5);
> Two queries will be sent to the data Source.  One for the subquery, and 
> another for the outer query. The subquery works correctly returning the 
> points in the group, however, outer query does not push a filter for point 
> column.
> Affect:
> The "group" table has a few hundred rows to group a few hundred thousand 
> points.  The data table has several billion rows keyed by point and time.  
> Without the ability to push down the filters for the columns of outer the 
> query, the data source cannot properly conduct its pruned scan.
> The subquery results should be pushed down to the outer query as an IN Filter 
> with the results of the subquery.
> {panel:title=Physical Plan}
> *Project [point#263, value#270]
> +- SortMergeJoin [point#263], [col#284], LeftSemi
>:- *Sort [point#263 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(point#263, 20)
>: +- *Filter ((time#264L >= 1487964691000) && (time#264L <= 
> 1487964696000))
>:+- *Scan @4b455128 .data[point#263,time#264Lvalue#270] 
> 

[jira] [Updated] (SPARK-19572) Allow to disable hive in sparkR shell

2017-02-24 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19572:
-
Target Version/s:   (was: 2.1.1)

> Allow to disable hive in sparkR shell
> -
>
> Key: SPARK-19572
> URL: https://issues.apache.org/jira/browse/SPARK-19572
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Jeff Zhang
>Priority: Minor
>
> SPARK-15236 do this for scala shell, this ticket is for sparkR shell.  This 
> is not only for sparkR itself, but can also benefit downstream project like 
> livy which use shell.R for its interactive session. For now, livy has no 
> control of whether enable hive or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19711) Bug in gapply function

2017-02-24 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883337#comment-15883337
 ] 

Felix Cheung commented on SPARK-19711:
--

Thanks I'll look into this shortly.

> Bug in gapply function
> --
>
> Key: SPARK-19711
> URL: https://issues.apache.org/jira/browse/SPARK-19711
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
> Environment: Using Databricks plataform.
>Reporter: Luis Felipe Sant Ana
> Attachments: mv_demand_20170221.csv, resume.R
>
>
> I have a dataframe in SparkR like 
>   CNPJPID   DATA N
> 1 10140281000131 100021 2015-04-23 1
> 2 10140281000131 100021 2015-04-27 1
> 3 10140281000131 100021 2015-04-02 1
> 4 10140281000131 100021 2015-11-10 1
> 5 10140281000131 100021 2016-11-14 1
> 6 10140281000131 100021 2015-04-03 1
> And, I want to group by columns CNPJ and PID using gapply() function, filling 
> in the column DATA with date. Then I fill in the missing dates with zeros.
> The code:
> schema <- structType(structField("CNPJ", "string"), 
>  structField("PID", "string"),
>  structField("DATA", "date"),
>  structField("N", "double"))
> result <- gapply(
>   ds_filtered,
>   c("CNPJ", "PID"),
>   function(key, x) {
> dts <- data.frame(key, DATA = seq(min(as.Date(x$DATA)), as.Date(e_date), 
> "days"))
> colnames(dts)[c(1, 2)] <- c("CNPJ", "PID")
> 
> y <- data.frame(key, DATA = as.Date(x$DATA), N = x$N)
> colnames(y)[c(1, 2)] <- c("CNPJ", "PID")
> 
> y <- dplyr::left_join(dts, 
>  y,
>  by = c("CNPJ", "PID", "DATA"))
> 
> y[is.na(y$N), 4] <- 0
> 
> data.frame(CNPJ = as.character(y$CNPJ),
>PID = as.character(y$PID),
>DATA = y$DATA,
>N = y$N)
>   }, 
>   schema)
> Error:
> Error in handleErrors(returnStatus, conn) : 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 92.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 92.0 (TID 7032, 10.93.243.111, executor 0): org.apache.spark.SparkException: 
> R computation failed with
>  Error in writeType(con, serdeType) : 
>   Unsupported type for serialization factor
> Calls: outputResult ... serializeRow -> writeList -> writeObject -> writeType
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$12.apply(objects.scala:404)
>   at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$12.apply(objects.scala:386)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   

[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-24 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883331#comment-15883331
 ] 

Charles Allen commented on SPARK-19698:
---

[~mridulm80] is there documentation somewhere that describes the output commit 
best practices? I see a bunch of things that seem to have either the hadoop MR 
output committer, or some Spark specific output committing stuff, but it is not 
clear when each should be used.

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2336) Approximate k-NN Models for MLLib

2017-02-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2336.
--
Resolution: Won't Fix

> Approximate k-NN Models for MLLib
> -
>
> Key: SPARK-2336
> URL: https://issues.apache.org/jira/browse/SPARK-2336
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: clustering, features
>
> After tackling the general k-Nearest Neighbor model as per 
> https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to 
> also offer approximate k-Nearest Neighbor. A promising approach would involve 
> building a kd-tree variant within from each partition, a la
> http://www.autonlab.org/autonweb/14714.html?branch=1=2
> This could offer a simple non-linear ML model that can label new data with 
> much lower latency than the plain-vanilla kNN versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17078) show estimated stats when doing explain

2017-02-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-17078:
---

Assignee: Zhenhua Wang

> show estimated stats when doing explain
> ---
>
> Key: SPARK-17078
> URL: https://issues.apache.org/jira/browse/SPARK-17078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>Assignee: Zhenhua Wang
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17078) show estimated stats when doing explain

2017-02-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17078.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16594
[https://github.com/apache/spark/pull/16594]

> show estimated stats when doing explain
> ---
>
> Key: SPARK-17078
> URL: https://issues.apache.org/jira/browse/SPARK-17078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19730) Predicate Subqueries do not push results of subqueries to data source

2017-02-24 Thread Shawn Lavelle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Lavelle updated SPARK-19730:
--
Description: 
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.

  was:
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.

{panel:title=Parsed Logical Plan}
Project ['cpid, 'value]
+- 'Filter ('cpid IN (list#280) && (('system_timestamp_ms >= ((cast('now() as 
bigint) * 1000) - 5000)) && ('system_timestamp_ms <= (cast('now() as bigint) * 
1000
   :  +- 'Project [unresolvedalias('explode('cpids), None)]
   : +- 'Filter ('station_id = 1)
   :+- 'UnresolvedRelation `stations`
   +- 'UnresolvedRelation `timeseries`
{panel}

== Analyzed Logical Plan ==
cpid: int, value: double
Project [cpid#265, value#272]
+- Filter (predicate-subquery#280 [(cpid#265 = col#283)] && 
((system_timestamp_ms#266L >= ((cast(current_timestamp() as bigint) * cast(1000 
as bigint)) - cast(5000 as bigint))) && (system_timestamp_ms#266L <= 
(cast(current_timestamp() as bigint) * cast(1000 as bigint)
   :  +- Project [col#283]
   : +- Generate explode(cpids#275), false, false, [col#283]
   :+- Filter (station_id#274 = 1)
   :   +- SubqueryAlias stations
   :  +- Relation[station#273,station_id#274,cpids#275] 
com.osii.hsh.analytics.spark.cassandra.HSHCassandraConnector@69a156b
   +- SubqueryAlias timeseries
  +- 
Relation[bucket#263L,split_factor#264,cpid#265,system_timestamp_ms#266L,source_timestamp_ms#267L,chronus_quality#268,tag#269,limits#270,quality#271,value#272]
 com.osii.hsh.analytics.spark.cassandra.HSHCassandraConnector@1bb602a0

== Optimized Logical Plan ==
Project [cpid#265, value#272]
+- Join LeftSemi, (cpid#265 = col#283)
   :- Filter ((system_timestamp_ms#266L >= 1487959796000) && 
(system_timestamp_ms#266L <= 1487959801000))
   :  +- 
Relation[bucket#263L,split_factor#264,cpid#265,system_timestamp_ms#266L,source_timestamp_ms#267L,chronus_quality#268,tag#269,limits#270,quality#271,value#272]
 com.osii.hsh.analytics.spark.cassandra.HSHCassandraConnector@1bb602a0
   +- Generate explode(cpids#275), false, false, [col#283]
  +- Project [cpids#275]
 +- Filter (station_id#274 = 1)
+- Relation[station#273,station_id#274,cpids#275] 
com.osii.hsh.analytics.spark.cassandra.HSHCassandraConnector@69a156b

== Physical Plan ==
*Project [cpid#265, value#272]
+- SortMergeJoin [cpid#265], [col#283], LeftSemi
   :- *Sort [cpid#265 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(cpid#265, 20)
   : +- *Filter ((system_timestamp_ms#266L >= 1487959796000) && 
(system_timestamp_ms#266L <= 1487959801000))
   :+- *Scan 
com.osii.hsh.analytics.spark.cassandra.HSHCassandraConnector@1bb602a0 
chronus_hsh_20xx.timeseries[bucket#263L,split_factor#264,cpid#265,system_timestamp_ms#266L,source_timestamp_ms#267L,chronus_quality#268,tag#269,limits#270,quality#271,value#272]
 PushedFilters: [GreaterThanOrEqual(system_timestamp_ms,1487959796000), 

[jira] [Updated] (SPARK-19730) Predicate Subqueries do not push results of subqueries to data source

2017-02-24 Thread Shawn Lavelle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Lavelle updated SPARK-19730:
--
Description: 
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.

{panel:title=Parsed Logical Plan}
Project ['cpid, 'value]
+- 'Filter ('cpid IN (list#280) && (('system_timestamp_ms >= ((cast('now() as 
bigint) * 1000) - 5000)) && ('system_timestamp_ms <= (cast('now() as bigint) * 
1000
   :  +- 'Project [unresolvedalias('explode('cpids), None)]
   : +- 'Filter ('station_id = 1)
   :+- 'UnresolvedRelation `stations`
   +- 'UnresolvedRelation `timeseries`
{panel}

== Analyzed Logical Plan ==
cpid: int, value: double
Project [cpid#265, value#272]
+- Filter (predicate-subquery#280 [(cpid#265 = col#283)] && 
((system_timestamp_ms#266L >= ((cast(current_timestamp() as bigint) * cast(1000 
as bigint)) - cast(5000 as bigint))) && (system_timestamp_ms#266L <= 
(cast(current_timestamp() as bigint) * cast(1000 as bigint)
   :  +- Project [col#283]
   : +- Generate explode(cpids#275), false, false, [col#283]
   :+- Filter (station_id#274 = 1)
   :   +- SubqueryAlias stations
   :  +- Relation[station#273,station_id#274,cpids#275] 
com.osii.hsh.analytics.spark.cassandra.HSHCassandraConnector@69a156b
   +- SubqueryAlias timeseries
  +- 
Relation[bucket#263L,split_factor#264,cpid#265,system_timestamp_ms#266L,source_timestamp_ms#267L,chronus_quality#268,tag#269,limits#270,quality#271,value#272]
 com.osii.hsh.analytics.spark.cassandra.HSHCassandraConnector@1bb602a0

== Optimized Logical Plan ==
Project [cpid#265, value#272]
+- Join LeftSemi, (cpid#265 = col#283)
   :- Filter ((system_timestamp_ms#266L >= 1487959796000) && 
(system_timestamp_ms#266L <= 1487959801000))
   :  +- 
Relation[bucket#263L,split_factor#264,cpid#265,system_timestamp_ms#266L,source_timestamp_ms#267L,chronus_quality#268,tag#269,limits#270,quality#271,value#272]
 com.osii.hsh.analytics.spark.cassandra.HSHCassandraConnector@1bb602a0
   +- Generate explode(cpids#275), false, false, [col#283]
  +- Project [cpids#275]
 +- Filter (station_id#274 = 1)
+- Relation[station#273,station_id#274,cpids#275] 
com.osii.hsh.analytics.spark.cassandra.HSHCassandraConnector@69a156b

== Physical Plan ==
*Project [cpid#265, value#272]
+- SortMergeJoin [cpid#265], [col#283], LeftSemi
   :- *Sort [cpid#265 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(cpid#265, 20)
   : +- *Filter ((system_timestamp_ms#266L >= 1487959796000) && 
(system_timestamp_ms#266L <= 1487959801000))
   :+- *Scan 
com.osii.hsh.analytics.spark.cassandra.HSHCassandraConnector@1bb602a0 
chronus_hsh_20xx.timeseries[bucket#263L,split_factor#264,cpid#265,system_timestamp_ms#266L,source_timestamp_ms#267L,chronus_quality#268,tag#269,limits#270,quality#271,value#272]
 PushedFilters: [GreaterThanOrEqual(system_timestamp_ms,1487959796000), 
LessThanOrEqual(system_timestamp_ms,14879..., ReadSchema: 
struct  |

  was:
When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data 

[jira] [Created] (SPARK-19730) Predicate Subqueries do not push results of subqueries to data source

2017-02-24 Thread Shawn Lavelle (JIRA)
Shawn Lavelle created SPARK-19730:
-

 Summary: Predicate Subqueries do not push results of subqueries to 
data source
 Key: SPARK-19730
 URL: https://issues.apache.org/jira/browse/SPARK-19730
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, SQL
Affects Versions: 2.1.0
Reporter: Shawn Lavelle


When a SparkSQL query contains a subquery in the where clause, such as a 
predicate query using the IN operator, the results of that subquery are not 
pushed down as a fileter to the DataSourceAPI for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() 
and point in (select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another 
for the outer query. The subquery works correctly returning the points in the 
group, however, outer query does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand 
points.  The data table has several billion rows keyed by point and time.  
Without the ability to push down the filters for the columns of outer the 
query, the data source cannot properly conduct its pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter 
with the results of the subquery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17495) Hive hash implementation

2017-02-24 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883203#comment-15883203
 ] 

Tejas Patil edited comment on SPARK-17495 at 2/24/17 5:57 PM:
--

[~rxin] : No probs. Any opinion about my comment from yesterday ?

ie. 
https://issues.apache.org/jira/browse/SPARK-17495?focusedCommentId=15882161=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15882161


was (Author: tejasp):
[~rxin] : No probs. Any opinion about my comment from yesterday ?

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2017-02-24 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883203#comment-15883203
 ] 

Tejas Patil commented on SPARK-17495:
-

[~rxin] : No probs. Any opinion about my comment from yesterday ?

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19729) Strange behaviour with reading csv with schema into dataframe

2017-02-24 Thread Mazen Melouk (JIRA)
Mazen Melouk created SPARK-19729:


 Summary: Strange behaviour with reading csv with schema into 
dataframe
 Key: SPARK-19729
 URL: https://issues.apache.org/jira/browse/SPARK-19729
 Project: Spark
  Issue Type: Bug
  Components: Java API, SQL
Affects Versions: 2.0.1
Reporter: Mazen Melouk


I have the following schema
[{first,string_type,false}
,{second,string_type,false}
,{third,string_type,false}
,{fourth,string_type,false}]

Example lines:
var1,var2,,

when accessing the row I get the following
row.size =4
row.fieldIndex(third_string)=2
row.getAs(third_string)=var2
row.get(2)=var2
print(row)= var1,var2

Any idea why the null values are missing?





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17495) Hive hash implementation

2017-02-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17495:


Assignee: Apache Spark  (was: Tejas Patil)

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17495) Hive hash implementation

2017-02-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17495:


Assignee: Tejas Patil  (was: Apache Spark)

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2017-02-24 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883189#comment-15883189
 ] 

Reynold Xin commented on SPARK-17495:
-

Ah yes. I kept doing it ... :)


> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17495) Hive hash implementation

2017-02-24 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil reopened SPARK-17495:
-

Re-opening. This is not done yet as there are few datatypes that need to be 
handled and making using of this hash in the codebase.

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17495) Hive hash implementation

2017-02-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17495.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19351) Support for obtaining file splits from underlying InputFormat

2017-02-24 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883177#comment-15883177
 ] 

Reynold Xin commented on SPARK-19351:
-

Approach 1 should be supported today. I actually think our data source API 
should support approach 2 as well in the future, so we can leave the ticket 
open for that.


> Support for obtaining file splits from underlying InputFormat
> -
>
> Key: SPARK-19351
> URL: https://issues.apache.org/jira/browse/SPARK-19351
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Vinoth Chandar
>
> This is a request for a feature, that enables SparkSQL to obtain the files 
> for a Hive partition, by calling inputFormat.getSplits(), as opposed to 
> listing files directly, while still using Spark's optimized Parquet readers 
> for actual IO. (Note that the difference between this and falling back 
> entirely to Hive via spark.sql.hive.convertMetastoreParquet=false is that we 
> get to realize benefits such as new parquet reader, schema merging etc in 
> SparkSQL)
> Some background the context, using our use-case at Uber. We have Hive tables, 
> where each partition contains versioned files (whenever records in a file 
> change, we produce a new version, to speed up database ingestion) and such 
> tables are registerd with a custom InputFormat that just filters out old 
> versions and just returns the latest version of each file to the query. 
> We have this working for 5 months now across Hive/Spark/Presto as follows 
> - Hive : Works out of box, by calling the inputFormat.getSplits, so we are 
> good there
> - Presto: We made the fix in Presto, similar to whats proposed here. 
> - Spark : We set convertMetastoreParquet=false. Perf is actually comparable 
> for our use-cases, but we run into schema merging issues now and then. 
> we have explored a few approaches here  and would like to get more feedback 
> from you all, before we go further.. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19038) Can't find keytab file when using Hive catalog

2017-02-24 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19038.

   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 2.2.0
   2.1.1
   2.0.3

> Can't find keytab file when using Hive catalog
> --
>
> Key: SPARK-19038
> URL: https://issues.apache.org/jira/browse/SPARK-19038
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.2
> Environment: Hadoop / YARN 2.6, pyspark, yarn-client mode
>Reporter: Peter Parente
>Assignee: Saisai Shao
>  Labels: hive, kerberos, pyspark
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> h2. Stack Trace
> {noformat}
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
> > 1 sdf = sql.createDataFrame(df)
> /opt/spark2/python/pyspark/sql/context.py in createDataFrame(self, data, 
> schema, samplingRatio, verifySchema)
> 307 Py4JJavaError: ...
> 308 """
> --> 309 return self.sparkSession.createDataFrame(data, schema, 
> samplingRatio, verifySchema)
> 310 
> 311 @since(1.3)
> /opt/spark2/python/pyspark/sql/session.py in createDataFrame(self, data, 
> schema, samplingRatio, verifySchema)
> 524 rdd, schema = self._createFromLocal(map(prepare, data), 
> schema)
> 525 jrdd = 
> self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
> --> 526 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), 
> schema.json())
> 527 df = DataFrame(jdf, self._wrapped)
> 528 df._schema = schema
> /opt/spark2/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134 
>1135 for temp_arg in temp_args:
> /opt/spark2/python/pyspark/sql/utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /opt/spark2/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o47.applySchemaToPythonRDD.
> : org.apache.spark.SparkException: Keytab file: 
> .keytab-f0b9b814-460e-4fa8-8e7d-029186b696c4 specified in spark.yarn.keytab 
> does not exist
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:113)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
>   at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
>   at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
>   at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
>   at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
>   at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
>   at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
>   at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
>   at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
>   at 
> org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:63)
>   at 
> org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
>   at 
> org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at 
> org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:666)
>   at 
> 

[jira] [Resolved] (SPARK-19707) Improve the invalid path check for sc.addJar

2017-02-24 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19707.

   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 2.2.0
   2.1.1

> Improve the invalid path check for sc.addJar
> 
>
> Key: SPARK-19707
> URL: https://issues.apache.org/jira/browse/SPARK-19707
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.1.1, 2.2.0
>
>
> Currently in Spark there're two issues when we add jars with invalid path:
> * If the jar path is a empty string {--jar ",dummy.jar"}, then Spark will 
> resolve it to the current directory path and add to classpath / file server, 
> which is unwanted.
> * If the jar path is a invalid path (file doesn't exist), file server doesn't 
> check this and will still added file server, the exception will be thrown 
> until job is running. This local path could be checked immediately, no need 
> to wait until task running. We have similar check in {{addFile}}, but lacks 
> similar one in {{addJar}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-24 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883125#comment-15883125
 ] 

Bill Chambers edited comment on SPARK-19714 at 2/24/17 5:15 PM:


The thing is QuantileDiscretizer and Bucketizer do fundamentally different 
things so there are different use cases there (quantiles vs actual values). 
It's more of a nuisance than anything and an unclear parameter that seems to 
imply things that are not actually the case.

Here's where it *really* falls apart, if I have a bucket and I provide one 
split, how many buckets do I have?

In Bucketizer I have none! That makes little sense. Splits is not the correct 
word here either because they aren't splits! They're bucket boundaries. I think 
this is more than a documentation issue, even though those aren't very clear 
themselves.

> Parameter for mapping continuous features into buckets. With n+1 splits, 
> there are n buckets. A bucket defined by splits x,y holds values in the range 
> [x,y) except the last bucket, which also includes y. Splits should be of 
> length greater than or equal to 3 and strictly increasing. Values at -inf, 
> inf must be explicitly provided to cover all Double values; otherwise, values 
> outside the splits specified will be treated as errors.

I also realize I'm being a pain here :) and that this stuff is always 
difficult. I empathize with that, it's just that this method doesn't seem to 
use correct terminology or a conceptually relevant implementation for what it 
aims to do.


was (Author: bill_chambers):
The thing is QuantileDiscretizer and Bucketizer do fundamentally different 
things so there are different use cases there (quantiles vs actual values). 
It's more of a nuisance than anything and an unclear parameter that seems to 
imply things that are not actually the case.

Here's where it *really* falls apart, if I have a bucket and I provide one 
split, how many buckets do I have?

In Bucketizer I have none! That makes no sense. Splits is not the correct word 
here either because they aren't splits! They're bounds or containers or buckets 
themselves. I think this is more than a documentation issue, even though those 
aren't very clear themselves.

> Parameter for mapping continuous features into buckets. With n+1 splits, 
> there are n buckets. A bucket defined by splits x,y holds values in the range 
> [x,y) except the last bucket, which also includes y. Splits should be of 
> length greater than or equal to 3 and strictly increasing. Values at -inf, 
> inf must be explicitly provided to cover all Double values; otherwise, values 
> outside the splits specified will be treated as errors.



> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-24 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883125#comment-15883125
 ] 

Bill Chambers commented on SPARK-19714:
---

The thing is QuantileDiscretizer and Bucketizer do fundamentally different 
things so there are different use cases there (quantiles vs actual values). 
It's more of a nuisance than anything and an unclear parameter that seems to 
imply things that are not actually the case.

Here's where it *really* falls apart, if I have a bucket and I provide one 
split, how many buckets do I have?

In Bucketizer I have none! That makes no sense. Splits is not the correct word 
here either because they aren't splits! They're bounds or containers or buckets 
themselves. I think this is more than a documentation issue, even though those 
aren't very clear themselves.

> Parameter for mapping continuous features into buckets. With n+1 splits, 
> there are n buckets. A bucket defined by splits x,y holds values in the range 
> [x,y) except the last bucket, which also includes y. Splits should be of 
> length greater than or equal to 3 and strictly increasing. Values at -inf, 
> inf must be explicitly provided to cover all Double values; otherwise, values 
> outside the splits specified will be treated as errors.



> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15678) Not use cache on appends and overwrites

2017-02-24 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883035#comment-15883035
 ] 

Kazuaki Ishizaki commented on SPARK-15678:
--

Sorry for being late to reply.

According to the comment of {{refreshByPath()}}, I think that it should work by 
calling {{refreshByPath()}} once.
{quote}
   Invalidate and refresh all the cached data (and the associated metadata) for 
any dataframe that
   contains the given data source path. Path matching is by prefix, i.e. "/" 
would invalidate
   everything that is cached.
{quote}

I think that it is better to create a new JIRA entry. This JIRA entry is not 
for reporting issues of {{refreshByPath()}}.
What do you think?



> Not use cache on appends and overwrites
> ---
>
> Key: SPARK-15678
> URL: https://issues.apache.org/jira/browse/SPARK-15678
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> SparkSQL currently doesn't drop caches if the underlying data is overwritten.
> {code}
> val dir = "/tmp/test"
> sqlContext.range(1000).write.mode("overwrite").parquet(dir)
> val df = sqlContext.read.parquet(dir).cache()
> df.count() // outputs 1000
> sqlContext.range(10).write.mode("overwrite").parquet(dir)
> sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We 
> are still using the cached dataset
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19161) Improving UDF Docstrings

2017-02-24 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883013#comment-15883013
 ] 

holdenk commented on SPARK-19161:
-

Thanks for working on this [~zero323], having better docs for UDFs from 
libraries should be useful :)

> Improving UDF Docstrings
> 
>
> Key: SPARK-19161
> URL: https://issues.apache.org/jira/browse/SPARK-19161
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>
> Current state
> Right now `udf` returns an `UserDefinedFunction` object which doesn't provide 
> meaningful docstring:
> {code}
> In [1]: from pyspark.sql.types import IntegerType
> In [2]: from pyspark.sql.functions import udf
> In [3]: def _add_one(x):
> """Adds one"""
> if x is not None:
> return x + 1
>...: 
> In [4]: add_one = udf(_add_one, IntegerType())
> In [5]: ?add_one
> Type:UserDefinedFunction
> String form:  0x7f281ed2d198>
> File:~/Spark/spark-2.0/python/pyspark/sql/functions.py
> Signature:   add_one(*cols)
> Docstring:
> User defined function in Python
> .. versionadded:: 1.3
> In [6]: help(add_one)
> Help on UserDefinedFunction in module pyspark.sql.functions object:
> class UserDefinedFunction(builtins.object)
>  |  User defined function in Python
>  |  
>  |  .. versionadded:: 1.3
>  |  
>  |  Methods defined here:
>  |  
>  |  __call__(self, *cols)
>  |  Call self as a function.
>  |  
>  |  __del__(self)
>  |  
>  |  __init__(self, func, returnType, name=None)
>  |  Initialize self.  See help(type(self)) for accurate signature.
>  |  
>  |  --
>  |  Data descriptors defined here:
>  |  
>  |  __dict__
>  |  dictionary for instance variables (if defined)
>  |  
>  |  __weakref__
>  |  list of weak references to the object (if defined)
> (END)
> {code}
> It is possible to extract the function:
> {code}
> In [7]: ?add_one.func
> Signature: add_one.func(x)
> Docstring: Adds one
> File:  ~/Spark/spark-2.0/
> Type:  function
> In [8]: help(add_one.func)
> Help on function _add_one in module __main__:
> _add_one(x)
> Adds one
> {code}
> but it assumes that the final user is aware of the distinction between UDF 
> and built-in functions.
> Proposed
> Copy input functions docstring to the UDF object or function wrapper. 
> {code}
> In [1]: from pyspark.sql.types import IntegerType
> In [2]: from pyspark.sql.functions import udf
> In [3]: def _add_one(x):
> """Adds one"""
> if x is not None:
> return x + 1
>...:
> In [4]: add_one = udf(_add_one, IntegerType())
> In [5]: ?add_one
> Signature: add_one(x)
> Docstring:
> Adds one
> SQL Type: IntegerType
> File:  ~/Workspace/spark/
> Type:  function
> In [6]: help(add_one)
> Help on function _add_one in module __main__:
> _add_one(x)
> Adds one
> 
> SQL Type: IntegerType
> (END)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19161) Improving UDF Docstrings

2017-02-24 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-19161.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16534
[https://github.com/apache/spark/pull/16534]

> Improving UDF Docstrings
> 
>
> Key: SPARK-19161
> URL: https://issues.apache.org/jira/browse/SPARK-19161
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>
> Current state
> Right now `udf` returns an `UserDefinedFunction` object which doesn't provide 
> meaningful docstring:
> {code}
> In [1]: from pyspark.sql.types import IntegerType
> In [2]: from pyspark.sql.functions import udf
> In [3]: def _add_one(x):
> """Adds one"""
> if x is not None:
> return x + 1
>...: 
> In [4]: add_one = udf(_add_one, IntegerType())
> In [5]: ?add_one
> Type:UserDefinedFunction
> String form:  0x7f281ed2d198>
> File:~/Spark/spark-2.0/python/pyspark/sql/functions.py
> Signature:   add_one(*cols)
> Docstring:
> User defined function in Python
> .. versionadded:: 1.3
> In [6]: help(add_one)
> Help on UserDefinedFunction in module pyspark.sql.functions object:
> class UserDefinedFunction(builtins.object)
>  |  User defined function in Python
>  |  
>  |  .. versionadded:: 1.3
>  |  
>  |  Methods defined here:
>  |  
>  |  __call__(self, *cols)
>  |  Call self as a function.
>  |  
>  |  __del__(self)
>  |  
>  |  __init__(self, func, returnType, name=None)
>  |  Initialize self.  See help(type(self)) for accurate signature.
>  |  
>  |  --
>  |  Data descriptors defined here:
>  |  
>  |  __dict__
>  |  dictionary for instance variables (if defined)
>  |  
>  |  __weakref__
>  |  list of weak references to the object (if defined)
> (END)
> {code}
> It is possible to extract the function:
> {code}
> In [7]: ?add_one.func
> Signature: add_one.func(x)
> Docstring: Adds one
> File:  ~/Spark/spark-2.0/
> Type:  function
> In [8]: help(add_one.func)
> Help on function _add_one in module __main__:
> _add_one(x)
> Adds one
> {code}
> but it assumes that the final user is aware of the distinction between UDF 
> and built-in functions.
> Proposed
> Copy input functions docstring to the UDF object or function wrapper. 
> {code}
> In [1]: from pyspark.sql.types import IntegerType
> In [2]: from pyspark.sql.functions import udf
> In [3]: def _add_one(x):
> """Adds one"""
> if x is not None:
> return x + 1
>...:
> In [4]: add_one = udf(_add_one, IntegerType())
> In [5]: ?add_one
> Signature: add_one(x)
> Docstring:
> Adds one
> SQL Type: IntegerType
> File:  ~/Workspace/spark/
> Type:  function
> In [6]: help(add_one)
> Help on function _add_one in module __main__:
> _add_one(x)
> Adds one
> 
> SQL Type: IntegerType
> (END)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19161) Improving UDF Docstrings

2017-02-24 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-19161:
---

Assignee: Maciej Szymkiewicz

> Improving UDF Docstrings
> 
>
> Key: SPARK-19161
> URL: https://issues.apache.org/jira/browse/SPARK-19161
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>
> Current state
> Right now `udf` returns an `UserDefinedFunction` object which doesn't provide 
> meaningful docstring:
> {code}
> In [1]: from pyspark.sql.types import IntegerType
> In [2]: from pyspark.sql.functions import udf
> In [3]: def _add_one(x):
> """Adds one"""
> if x is not None:
> return x + 1
>...: 
> In [4]: add_one = udf(_add_one, IntegerType())
> In [5]: ?add_one
> Type:UserDefinedFunction
> String form:  0x7f281ed2d198>
> File:~/Spark/spark-2.0/python/pyspark/sql/functions.py
> Signature:   add_one(*cols)
> Docstring:
> User defined function in Python
> .. versionadded:: 1.3
> In [6]: help(add_one)
> Help on UserDefinedFunction in module pyspark.sql.functions object:
> class UserDefinedFunction(builtins.object)
>  |  User defined function in Python
>  |  
>  |  .. versionadded:: 1.3
>  |  
>  |  Methods defined here:
>  |  
>  |  __call__(self, *cols)
>  |  Call self as a function.
>  |  
>  |  __del__(self)
>  |  
>  |  __init__(self, func, returnType, name=None)
>  |  Initialize self.  See help(type(self)) for accurate signature.
>  |  
>  |  --
>  |  Data descriptors defined here:
>  |  
>  |  __dict__
>  |  dictionary for instance variables (if defined)
>  |  
>  |  __weakref__
>  |  list of weak references to the object (if defined)
> (END)
> {code}
> It is possible to extract the function:
> {code}
> In [7]: ?add_one.func
> Signature: add_one.func(x)
> Docstring: Adds one
> File:  ~/Spark/spark-2.0/
> Type:  function
> In [8]: help(add_one.func)
> Help on function _add_one in module __main__:
> _add_one(x)
> Adds one
> {code}
> but it assumes that the final user is aware of the distinction between UDF 
> and built-in functions.
> Proposed
> Copy input functions docstring to the UDF object or function wrapper. 
> {code}
> In [1]: from pyspark.sql.types import IntegerType
> In [2]: from pyspark.sql.functions import udf
> In [3]: def _add_one(x):
> """Adds one"""
> if x is not None:
> return x + 1
>...:
> In [4]: add_one = udf(_add_one, IntegerType())
> In [5]: ?add_one
> Signature: add_one(x)
> Docstring:
> Adds one
> SQL Type: IntegerType
> File:  ~/Workspace/spark/
> Type:  function
> In [6]: help(add_one)
> Help on function _add_one in module __main__:
> _add_one(x)
> Adds one
> 
> SQL Type: IntegerType
> (END)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19715) Option to Strip Paths in FileSource

2017-02-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882995#comment-15882995
 ] 

Steve Loughran commented on SPARK-19715:


This is a silly question, but has the situation " a filesystem schema has 
changed" ever arisen? Because I can see the risk of that being lower than the 
risk that a file with the same name is added to > 1 directory included in the 
same scan

> Option to Strip Paths in FileSource
> ---
>
> Key: SPARK-19715
> URL: https://issues.apache.org/jira/browse/SPARK-19715
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>
> Today, we compare the whole path when deciding if a file is new in the 
> FileSource for structured streaming.  However, this cause cause false 
> negatives in the case where the path has changed in a cosmetic way (i.e. 
> changing s3n to s3a).  We should add an option {{fileNameOnly}} that causes 
> the new file check to be based only on the filename (but still store the 
> whole path in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19728) PythonUDF with multiple parents shouldn't be pushed down when used as a predicate

2017-02-24 Thread Maciej Szymkiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-19728:
---
Summary:  PythonUDF with multiple parents shouldn't be pushed down when 
used as a predicate  (was:  PythonUDF with multiple parents shouldn't be pushed 
down when used as a predicat)

>  PythonUDF with multiple parents shouldn't be pushed down when used as a 
> predicate
> --
>
> Key: SPARK-19728
> URL: https://issues.apache.org/jira/browse/SPARK-19728
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Prior to Spark 2.0 it was possible to use Python UDF output as a predicate:
> {code}
> from pyspark.sql.functions import udf
> from pyspark.sql.types import BooleanType
> df1 = sc.parallelize([(1, ), (2, )]).toDF(["col_a"])
> df2 = sc.parallelize([(2, ), (3, )]).toDF(["col_b"])
> pred = udf(lambda x, y: x == y, BooleanType())
> df1.join(df2).where(pred("col_a", "col_b")).show()
> {code}
> In Spark 2.0 this is no longer possible:
> {code}
> spark.conf.set("spark.sql.crossJoin.enabled", True)
> df1.join(df2).where(pred("col_a", "col_b")).show()
> ## ...
> ## Py4JJavaError: An error occurred while calling o731.showString.
> : java.lang.RuntimeException: Invalid PythonUDF (col_a#132L, 
> col_b#135L), requires attributes from more than one child.
> ## ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19728) PythonUDF with multiple parents shouldn't be pushed down when used as a predicat

2017-02-24 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-19728:
--

 Summary:  PythonUDF with multiple parents shouldn't be pushed down 
when used as a predicat
 Key: SPARK-19728
 URL: https://issues.apache.org/jira/browse/SPARK-19728
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.1.0, 2.0.0, 2.2.0
Reporter: Maciej Szymkiewicz


Prior to Spark 2.0 it was possible to use Python UDF output as a predicate:

{code}
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType

df1 = sc.parallelize([(1, ), (2, )]).toDF(["col_a"])
df2 = sc.parallelize([(2, ), (3, )]).toDF(["col_b"])
pred = udf(lambda x, y: x == y, BooleanType())

df1.join(df2).where(pred("col_a", "col_b")).show()
{code}

In Spark 2.0 this is no longer possible:

{code}
spark.conf.set("spark.sql.crossJoin.enabled", True)
df1.join(df2).where(pred("col_a", "col_b")).show()

## ...
## Py4JJavaError: An error occurred while calling o731.showString.
: java.lang.RuntimeException: Invalid PythonUDF (col_a#132L, 
col_b#135L), requires attributes from more than one child.
## ...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2017-02-24 Thread Dean Wampler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882778#comment-15882778
 ] 

Dean Wampler commented on SPARK-17147:
--

We're interested in this enhancement. Anyone know if and one it will be 
implemented in Spark?

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
> --
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19724) create a managed table with an existed default location should throw an exception

2017-02-24 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19724:
-
Summary: create a managed table with an existed default location should 
throw an exception  (was: create managed table for hive tables with an existed 
default location should throw an exception)

> create a managed table with an existed default location should throw an 
> exception
> -
>
> Key: SPARK-19724
> URL: https://issues.apache.org/jira/browse/SPARK-19724
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> This JIRA is a follow up work after 
> [SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583)
> As we discussed in that [PR](https://github.com/apache/spark/pull/16938)
> The following DDL for a managed table with an existed default location should 
> throw an exception:
> {code}
> CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> CREATE TABLE ... (PARTITIONED BY ...)
> {code}
> Currently there are some situations which are not consist with above logic:
> 1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default 
> location
> situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog)
> 2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> situation: hive table succeed with an existed default location



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19724) create managed table for hive tables with an existed default location should throw an exception

2017-02-24 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19724:
-
Description: 
This JIRA is a follow up work after 
[SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583)

As we discussed in that [PR](https://github.com/apache/spark/pull/16938)

The following DDL for a managed table with an existed default location should 
throw an exception:
{code}
CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
CREATE TABLE ... (PARTITIONED BY ...)
{code}
Currently there are some situations which are not consist with above logic:

1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default 
location
situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog)

2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
situation: hive table succeed with an existed default location


  was:
This JIRA is a follow up work after SPARK-19583

As we discussed in that [PR|https://github.com/apache/spark/pull/16938] 

The following DDL for hive table with an existed default location should throw 
an exception:
{code}
CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
{code}
Currently it will success for this situation


> create managed table for hive tables with an existed default location should 
> throw an exception
> ---
>
> Key: SPARK-19724
> URL: https://issues.apache.org/jira/browse/SPARK-19724
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> This JIRA is a follow up work after 
> [SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583)
> As we discussed in that [PR](https://github.com/apache/spark/pull/16938)
> The following DDL for a managed table with an existed default location should 
> throw an exception:
> {code}
> CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> CREATE TABLE ... (PARTITIONED BY ...)
> {code}
> Currently there are some situations which are not consist with above logic:
> 1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default 
> location
> situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog)
> 2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> situation: hive table succeed with an existed default location



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19711) Bug in gapply function

2017-02-24 Thread Luis Felipe Sant Ana (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882475#comment-15882475
 ] 

Luis Felipe Sant Ana edited comment on SPARK-19711 at 2/24/17 11:24 AM:


The problem seems to be in using the string type in the schema. I converted the 
CNPJ field to double

ds <- withColumn (ds, "CNPJ", cast (ds$CNPJ, "double"))
printSchema(ds)

And I modified the schema to return a double:

#gapply function
schema <- structType (structField ("CNPJ", "double"))

result <- gapply(
   ds,
   c("CNPJ", "PID"),
   function(key, x) {

 data.frame(CNPJ = x$CNPJ)

   },
   schema)

This works.


was (Author: luisfsantana_20):
The problem seems to be in using the string type in the schema. I converted the 
CNPJ field to double

ds <- withColumn (ds, "CNPJ", cast (ds$CNPJ, "double"))
PrintSchema(ds)

And I modified the schema to return a double:

#gapply function
schema <- structType (structField ("CNPJ", "double"))

result <- gapply(
   ds,
   c("CNPJ", "PID"),
   function(key, x) {

 data.frame(CNPJ = x$CNPJ)

   },
   schema)

This works.

> Bug in gapply function
> --
>
> Key: SPARK-19711
> URL: https://issues.apache.org/jira/browse/SPARK-19711
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
> Environment: Using Databricks plataform.
>Reporter: Luis Felipe Sant Ana
> Attachments: mv_demand_20170221.csv, resume.R
>
>
> I have a dataframe in SparkR like 
>   CNPJPID   DATA N
> 1 10140281000131 100021 2015-04-23 1
> 2 10140281000131 100021 2015-04-27 1
> 3 10140281000131 100021 2015-04-02 1
> 4 10140281000131 100021 2015-11-10 1
> 5 10140281000131 100021 2016-11-14 1
> 6 10140281000131 100021 2015-04-03 1
> And, I want to group by columns CNPJ and PID using gapply() function, filling 
> in the column DATA with date. Then I fill in the missing dates with zeros.
> The code:
> schema <- structType(structField("CNPJ", "string"), 
>  structField("PID", "string"),
>  structField("DATA", "date"),
>  structField("N", "double"))
> result <- gapply(
>   ds_filtered,
>   c("CNPJ", "PID"),
>   function(key, x) {
> dts <- data.frame(key, DATA = seq(min(as.Date(x$DATA)), as.Date(e_date), 
> "days"))
> colnames(dts)[c(1, 2)] <- c("CNPJ", "PID")
> 
> y <- data.frame(key, DATA = as.Date(x$DATA), N = x$N)
> colnames(y)[c(1, 2)] <- c("CNPJ", "PID")
> 
> y <- dplyr::left_join(dts, 
>  y,
>  by = c("CNPJ", "PID", "DATA"))
> 
> y[is.na(y$N), 4] <- 0
> 
> data.frame(CNPJ = as.character(y$CNPJ),
>PID = as.character(y$PID),
>DATA = y$DATA,
>N = y$N)
>   }, 
>   schema)
> Error:
> Error in handleErrors(returnStatus, conn) : 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 92.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 92.0 (TID 7032, 10.93.243.111, executor 0): org.apache.spark.SparkException: 
> R computation failed with
>  Error in writeType(con, serdeType) : 
>   Unsupported type for serialization factor
> Calls: outputResult ... serializeRow -> writeList -> writeObject -> writeType
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$12.apply(objects.scala:404)
>   at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$12.apply(objects.scala:386)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> 

[jira] [Comment Edited] (SPARK-19711) Bug in gapply function

2017-02-24 Thread Luis Felipe Sant Ana (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882475#comment-15882475
 ] 

Luis Felipe Sant Ana edited comment on SPARK-19711 at 2/24/17 11:23 AM:


The problem seems to be in using the string type in the schema. I converted the 
CNPJ field to double

ds <- withColumn (ds, "CNPJ", cast (ds$CNPJ, "double"))
PrintSchema(ds)

And I modified the schema to return a double:

#gapply function
schema <- structType (structField ("CNPJ", "double"))

result <- gapply(
   ds,
   c("CNPJ", "PID"),
   function(key, x) {

 data.frame(CNPJ = x$CNPJ)

   },
   schema)

This works.


was (Author: luisfsantana_20):
The problem seems to be in using the string type in the schema. I converted the 
CNPJ field to double

ds <- withColumn (ds, "CNPJ", cast (ds $ CNPJ, "double"))
PrintSchema (ds)

And I modified the schema to return a double:

#gapply function
schema <- structType (structField ("CNPJ", "double"))

Result <- gapply (
   Ds,
   C ("CNPJ", "PID"),
   Function (key, x) {

 Data.frame (CNPJ = x$CNPJ)

   },
   Schema)

This works.

> Bug in gapply function
> --
>
> Key: SPARK-19711
> URL: https://issues.apache.org/jira/browse/SPARK-19711
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
> Environment: Using Databricks plataform.
>Reporter: Luis Felipe Sant Ana
> Attachments: mv_demand_20170221.csv, resume.R
>
>
> I have a dataframe in SparkR like 
>   CNPJPID   DATA N
> 1 10140281000131 100021 2015-04-23 1
> 2 10140281000131 100021 2015-04-27 1
> 3 10140281000131 100021 2015-04-02 1
> 4 10140281000131 100021 2015-11-10 1
> 5 10140281000131 100021 2016-11-14 1
> 6 10140281000131 100021 2015-04-03 1
> And, I want to group by columns CNPJ and PID using gapply() function, filling 
> in the column DATA with date. Then I fill in the missing dates with zeros.
> The code:
> schema <- structType(structField("CNPJ", "string"), 
>  structField("PID", "string"),
>  structField("DATA", "date"),
>  structField("N", "double"))
> result <- gapply(
>   ds_filtered,
>   c("CNPJ", "PID"),
>   function(key, x) {
> dts <- data.frame(key, DATA = seq(min(as.Date(x$DATA)), as.Date(e_date), 
> "days"))
> colnames(dts)[c(1, 2)] <- c("CNPJ", "PID")
> 
> y <- data.frame(key, DATA = as.Date(x$DATA), N = x$N)
> colnames(y)[c(1, 2)] <- c("CNPJ", "PID")
> 
> y <- dplyr::left_join(dts, 
>  y,
>  by = c("CNPJ", "PID", "DATA"))
> 
> y[is.na(y$N), 4] <- 0
> 
> data.frame(CNPJ = as.character(y$CNPJ),
>PID = as.character(y$PID),
>DATA = y$DATA,
>N = y$N)
>   }, 
>   schema)
> Error:
> Error in handleErrors(returnStatus, conn) : 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 92.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 92.0 (TID 7032, 10.93.243.111, executor 0): org.apache.spark.SparkException: 
> R computation failed with
>  Error in writeType(con, serdeType) : 
>   Unsupported type for serialization factor
> Calls: outputResult ... serializeRow -> writeList -> writeObject -> writeType
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$12.apply(objects.scala:404)
>   at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$12.apply(objects.scala:386)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> 

[jira] [Commented] (SPARK-19711) Bug in gapply function

2017-02-24 Thread Luis Felipe Sant Ana (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882475#comment-15882475
 ] 

Luis Felipe Sant Ana commented on SPARK-19711:
--

The problem seems to be in using the string type in the schema. I converted the 
CNPJ field to double

ds <- withColumn (ds, "CNPJ", cast (ds $ CNPJ, "double"))
PrintSchema (ds)

And I modified the schema to return a double:

#gapply function
schema <- structType (structField ("CNPJ", "double"))

Result <- gapply (
   Ds,
   C ("CNPJ", "PID"),
   Function (key, x) {

 Data.frame (CNPJ = x$CNPJ)

   },
   Schema)

This works.

> Bug in gapply function
> --
>
> Key: SPARK-19711
> URL: https://issues.apache.org/jira/browse/SPARK-19711
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
> Environment: Using Databricks plataform.
>Reporter: Luis Felipe Sant Ana
> Attachments: mv_demand_20170221.csv, resume.R
>
>
> I have a dataframe in SparkR like 
>   CNPJPID   DATA N
> 1 10140281000131 100021 2015-04-23 1
> 2 10140281000131 100021 2015-04-27 1
> 3 10140281000131 100021 2015-04-02 1
> 4 10140281000131 100021 2015-11-10 1
> 5 10140281000131 100021 2016-11-14 1
> 6 10140281000131 100021 2015-04-03 1
> And, I want to group by columns CNPJ and PID using gapply() function, filling 
> in the column DATA with date. Then I fill in the missing dates with zeros.
> The code:
> schema <- structType(structField("CNPJ", "string"), 
>  structField("PID", "string"),
>  structField("DATA", "date"),
>  structField("N", "double"))
> result <- gapply(
>   ds_filtered,
>   c("CNPJ", "PID"),
>   function(key, x) {
> dts <- data.frame(key, DATA = seq(min(as.Date(x$DATA)), as.Date(e_date), 
> "days"))
> colnames(dts)[c(1, 2)] <- c("CNPJ", "PID")
> 
> y <- data.frame(key, DATA = as.Date(x$DATA), N = x$N)
> colnames(y)[c(1, 2)] <- c("CNPJ", "PID")
> 
> y <- dplyr::left_join(dts, 
>  y,
>  by = c("CNPJ", "PID", "DATA"))
> 
> y[is.na(y$N), 4] <- 0
> 
> data.frame(CNPJ = as.character(y$CNPJ),
>PID = as.character(y$PID),
>DATA = y$DATA,
>N = y$N)
>   }, 
>   schema)
> Error:
> Error in handleErrors(returnStatus, conn) : 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 92.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 92.0 (TID 7032, 10.93.243.111, executor 0): org.apache.spark.SparkException: 
> R computation failed with
>  Error in writeType(con, serdeType) : 
>   Unsupported type for serialization factor
> Calls: outputResult ... serializeRow -> writeList -> writeObject -> writeType
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$12.apply(objects.scala:404)
>   at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$12.apply(objects.scala:386)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> 

  1   2   >