[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-10-28 Thread Dibyendu Bhattacharya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977874#comment-14977874
 ] 

Dibyendu Bhattacharya commented on SPARK-11045:
---

hi [~tdas] , let me know what is your comment on this Jira. Do you think this 
should continue to be part of Spark Packages or good to include into Spark 
Project so that community will get reliable and better alternative of Direct 
API ( if at all someone does not want Direct API) ? 

> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling by number of message can never give any deterministic 
> size of your block hence there is no guarantee that Memory Back-Pressure can 
> really take affect. 
> 3. This consumer is using Kafka low level API which gives better performance 
> than KafkaUtils.createStream based High Level API.
> 4. This consumer can give end to end no data loss channel if enabled with WAL.
> By accepting this low level kafka consumer from spark packages to apache 
> spark project , we will give community a better options for Kafka 
> connectivity both for Receiver less and Receiver based model. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json

2015-10-28 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-10517:
---
Affects Version/s: (was: 1.5.0)
   1.5.1

> Console "Output" field is empty when using DataFrameWriter.json
> ---
>
> Key: SPARK-10517
> URL: https://issues.apache.org/jira/browse/SPARK-10517
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> On HTTP application UI "Output" field is empty when using 
> DataFrameWriter.json.
> Should by size of written bytes.
> Screenshot attached,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json

2015-10-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977890#comment-14977890
 ] 

Maciej Bryński commented on SPARK-10517:


No.
I think that output field is always empty.


> Console "Output" field is empty when using DataFrameWriter.json
> ---
>
> Key: SPARK-10517
> URL: https://issues.apache.org/jira/browse/SPARK-10517
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> On HTTP application UI "Output" field is empty when using 
> DataFrameWriter.json.
> Should by size of written bytes.
> Screenshot attached,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11332) WeightedLeastSquares should use ml features generic Instance class instead of private

2015-10-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977891#comment-14977891
 ] 

Sean Owen commented on SPARK-11332:
---

They need to be a Contributor in JIRA. I can add this person. If you end up 
regularly adding new people we can make you an admin. 

> WeightedLeastSquares should use ml features generic Instance class instead of 
> private
> -
>
> Key: SPARK-11332
> URL: https://issues.apache.org/jira/browse/SPARK-11332
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: holdenk
>Priority: Minor
>
> WeightedLeastSquares should use the common Instance class in ml.feature 
> instead of a private one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11332) WeightedLeastSquares should use ml features generic Instance class instead of private

2015-10-28 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977893#comment-14977893
 ] 

DB Tsai commented on SPARK-11332:
-

Thanks. Please help me to add his as contributor. 

> WeightedLeastSquares should use ml features generic Instance class instead of 
> private
> -
>
> Key: SPARK-11332
> URL: https://issues.apache.org/jira/browse/SPARK-11332
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: holdenk
>Priority: Minor
>
> WeightedLeastSquares should use the common Instance class in ml.feature 
> instead of a private one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11365) consolidate aggregates for summary statistics in weighted least squares

2015-10-28 Thread holdenk (JIRA)
holdenk created SPARK-11365:
---

 Summary: consolidate aggregates for summary statistics in weighted 
least squares
 Key: SPARK-11365
 URL: https://issues.apache.org/jira/browse/SPARK-11365
 Project: Spark
  Issue Type: Improvement
Reporter: holdenk
Priority: Minor


Right now we duplicate some aggregates in the aggregator, we could simplify 
this a bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11332) WeightedLeastSquares should use ml features generic Instance class instead of private

2015-10-28 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-11332.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9325
[https://github.com/apache/spark/pull/9325]

> WeightedLeastSquares should use ml features generic Instance class instead of 
> private
> -
>
> Key: SPARK-11332
> URL: https://issues.apache.org/jira/browse/SPARK-11332
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: holdenk
>Priority: Minor
> Fix For: 1.6.0
>
>
> WeightedLeastSquares should use the common Instance class in ml.feature 
> instead of a private one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11246) [1.5] Table cache for Parquet broken in 1.5

2015-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977940#comment-14977940
 ] 

Apache Spark commented on SPARK-11246:
--

User 'xwu0226' has created a pull request for this issue:
https://github.com/apache/spark/pull/9326

> [1.5] Table cache for Parquet broken in 1.5
> ---
>
> Key: SPARK-11246
> URL: https://issues.apache.org/jira/browse/SPARK-11246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: David Ross
>
> Since upgrading to 1.5.1, using the {{CACHE TABLE}} works great for all 
> tables except for parquet tables, likely related to the parquet native reader.
> Here are steps for parquet table:
> {code}
> create table test_parquet stored as parquet as select 1;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#141]
> {code}
> And then caching:
> {code}
> cache table test_parquet;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#174]
> {code}
> Note it isn't cached. I have included spark log output for the {{cache 
> table}} and {{explain}} statements below.
> ---
> Here's the same for non-parquet table:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> HiveTableScan [_c0#210], (MetastoreRelation default, test_no_parquet, None)
> {code}
> And then caching:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> InMemoryColumnarTableScan [_c0#229], (InMemoryRelation [_c0#229], true, 
> 1, StorageLevel(true, true, false, true, 1), (HiveTableScan [_c0#211], 
> (MetastoreRelation default, test_no_parquet, None)), Some(test_no_parquet))
> {code}
> Not that the table seems to be cached.
> ---
> Note that if the flag {{spark.sql.hive.convertMetastoreParquet}} is set to 
> {{false}}, parquet tables work the same as non-parquet tables with caching. 
> This is a reasonable workaround for us, but ideally, we would like to benefit 
> from the native reading.
> ---
> Spark logs for {{cache table}} for {{test_parquet}}:
> {code}
> 15/10/21 21:22:05 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'cache table test_parquet' with 20ee2ab9-5242-4783-81cf-46115ed72610
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/10/21 21:22:05 INFO metastore.ObjectStore: ObjectStore, initialize called
> 15/10/21 21:22:05 INFO DataNucleus.Query: Reading in results for query 
> "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
> closing
> 15/10/21 21:22:05 INFO metastore.MetaStoreDirectSql: Using direct SQL, 
> underlying DB is MYSQL
> 15/10/21 21:22:05 INFO metastore.ObjectStore: Initialized ObjectStore
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4196713, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59 stored as 
> values in memory (estimated size 210.6 KB, free 128.4 MB)
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(20265) called 
> with curMem=4412393, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59_piece0 stored 
> as bytes in memory (estimated size 19.8 KB, free 128.3 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Added broadcast_59_piece0 in 
> memory on 192.168.99.9:50262 (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO spark.SparkContext: Created broadcast 59 from run at 
> AccessController.java:-2
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4432658, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_60 stored as 
> values in memory (estimated size 210.6 KB, free 128.1 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_58_piece0 
> on 192.168.99.9:50262 in memory (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_57_piece0 
> on 192

[jira] [Assigned] (SPARK-11246) [1.5] Table cache for Parquet broken in 1.5

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11246:


Assignee: Apache Spark

> [1.5] Table cache for Parquet broken in 1.5
> ---
>
> Key: SPARK-11246
> URL: https://issues.apache.org/jira/browse/SPARK-11246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: David Ross
>Assignee: Apache Spark
>
> Since upgrading to 1.5.1, using the {{CACHE TABLE}} works great for all 
> tables except for parquet tables, likely related to the parquet native reader.
> Here are steps for parquet table:
> {code}
> create table test_parquet stored as parquet as select 1;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#141]
> {code}
> And then caching:
> {code}
> cache table test_parquet;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#174]
> {code}
> Note it isn't cached. I have included spark log output for the {{cache 
> table}} and {{explain}} statements below.
> ---
> Here's the same for non-parquet table:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> HiveTableScan [_c0#210], (MetastoreRelation default, test_no_parquet, None)
> {code}
> And then caching:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> InMemoryColumnarTableScan [_c0#229], (InMemoryRelation [_c0#229], true, 
> 1, StorageLevel(true, true, false, true, 1), (HiveTableScan [_c0#211], 
> (MetastoreRelation default, test_no_parquet, None)), Some(test_no_parquet))
> {code}
> Not that the table seems to be cached.
> ---
> Note that if the flag {{spark.sql.hive.convertMetastoreParquet}} is set to 
> {{false}}, parquet tables work the same as non-parquet tables with caching. 
> This is a reasonable workaround for us, but ideally, we would like to benefit 
> from the native reading.
> ---
> Spark logs for {{cache table}} for {{test_parquet}}:
> {code}
> 15/10/21 21:22:05 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'cache table test_parquet' with 20ee2ab9-5242-4783-81cf-46115ed72610
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/10/21 21:22:05 INFO metastore.ObjectStore: ObjectStore, initialize called
> 15/10/21 21:22:05 INFO DataNucleus.Query: Reading in results for query 
> "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
> closing
> 15/10/21 21:22:05 INFO metastore.MetaStoreDirectSql: Using direct SQL, 
> underlying DB is MYSQL
> 15/10/21 21:22:05 INFO metastore.ObjectStore: Initialized ObjectStore
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4196713, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59 stored as 
> values in memory (estimated size 210.6 KB, free 128.4 MB)
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(20265) called 
> with curMem=4412393, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59_piece0 stored 
> as bytes in memory (estimated size 19.8 KB, free 128.3 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Added broadcast_59_piece0 in 
> memory on 192.168.99.9:50262 (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO spark.SparkContext: Created broadcast 59 from run at 
> AccessController.java:-2
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4432658, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_60 stored as 
> values in memory (estimated size 210.6 KB, free 128.1 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_58_piece0 
> on 192.168.99.9:50262 in memory (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_57_piece0 
> on 192.168.99.9:50262 in memory (size: 21.1 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO stora

[jira] [Assigned] (SPARK-11246) [1.5] Table cache for Parquet broken in 1.5

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11246:


Assignee: (was: Apache Spark)

> [1.5] Table cache for Parquet broken in 1.5
> ---
>
> Key: SPARK-11246
> URL: https://issues.apache.org/jira/browse/SPARK-11246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: David Ross
>
> Since upgrading to 1.5.1, using the {{CACHE TABLE}} works great for all 
> tables except for parquet tables, likely related to the parquet native reader.
> Here are steps for parquet table:
> {code}
> create table test_parquet stored as parquet as select 1;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#141]
> {code}
> And then caching:
> {code}
> cache table test_parquet;
> explain select * from test_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> Scan 
> ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#174]
> {code}
> Note it isn't cached. I have included spark log output for the {{cache 
> table}} and {{explain}} statements below.
> ---
> Here's the same for non-parquet table:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> HiveTableScan [_c0#210], (MetastoreRelation default, test_no_parquet, None)
> {code}
> And then caching:
> {code}
> cache table test_no_parquet;
> explain select * from test_no_parquet;
> {code}
> With output:
> {code}
> == Physical Plan ==
> InMemoryColumnarTableScan [_c0#229], (InMemoryRelation [_c0#229], true, 
> 1, StorageLevel(true, true, false, true, 1), (HiveTableScan [_c0#211], 
> (MetastoreRelation default, test_no_parquet, None)), Some(test_no_parquet))
> {code}
> Not that the table seems to be cached.
> ---
> Note that if the flag {{spark.sql.hive.convertMetastoreParquet}} is set to 
> {{false}}, parquet tables work the same as non-parquet tables with caching. 
> This is a reasonable workaround for us, but ideally, we would like to benefit 
> from the native reading.
> ---
> Spark logs for {{cache table}} for {{test_parquet}}:
> {code}
> 15/10/21 21:22:05 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'cache table test_parquet' with 20ee2ab9-5242-4783-81cf-46115ed72610
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/10/21 21:22:05 INFO metastore.ObjectStore: ObjectStore, initialize called
> 15/10/21 21:22:05 INFO DataNucleus.Query: Reading in results for query 
> "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
> closing
> 15/10/21 21:22:05 INFO metastore.MetaStoreDirectSql: Using direct SQL, 
> underlying DB is MYSQL
> 15/10/21 21:22:05 INFO metastore.ObjectStore: Initialized ObjectStore
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4196713, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59 stored as 
> values in memory (estimated size 210.6 KB, free 128.4 MB)
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(20265) called 
> with curMem=4412393, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59_piece0 stored 
> as bytes in memory (estimated size 19.8 KB, free 128.3 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Added broadcast_59_piece0 in 
> memory on 192.168.99.9:50262 (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO spark.SparkContext: Created broadcast 59 from run at 
> AccessController.java:-2
> 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default 
> tbl=test_parquet
> 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant   
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=test_parquet
> 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called 
> with curMem=4432658, maxMem=139009720
> 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_60 stored as 
> values in memory (estimated size 210.6 KB, free 128.1 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_58_piece0 
> on 192.168.99.9:50262 in memory (size: 19.8 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_57_piece0 
> on 192.168.99.9:50262 in memory (size: 21.1 KB, free: 132.2 MB)
> 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Remo

[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-28 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977961#comment-14977961
 ] 

Cheng Lian commented on SPARK-11103:


Quoted from my reply on the user list:

For 1: This one is pretty simple and safe, I'd like to have this for 1.5.2, or 
1.5.3 if we can't make it for 1.5.2.

For 2: Actually we only need to calculate the intersection of all file 
schemata. We can make ParquetRelation.mergeSchemaInParallel return two 
StructTypes, the first one is the original merged schema, the other is the 
intersection of all file schemata, which only contains fields that exist in all 
file schemata. Then we decide which filter to pushed down according to the 
second StructType.

For 3:  The idea with which I came up at first was similar to this one. Instead 
of pulling all file schemata to driver side, we can push filter push-down code 
to executor side. Namely, passing candidate filters to executor side, and 
compute the Parquet filter predicates according to individual file schema. I 
haven't looked into this direction in depth, but we can probably put this part 
into CatalystReadSupport, which is now initialized on executor side.   However, 
correctness of this approach can only be guaranteed by the defensive filtering 
we do in Spark SQL (i.e. apply all the filters no matter they are pushed down 
or not), but we are considering to remove it because it imposes unnecessary 
performance cost. This makes me hesitant to go along this way.

>From my side, I think this is a bug of Parquet. Parquet was designed to 
>support schema evolution. When scanning a Parquet file, if a column exists in 
>the requested schema but is missing in the file schema, that column is filled 
>with null. This should also hold for pushed-down filter predicates. For 
>example, if filter "a = 1" is pushed down but column "a" doesn't exist in the 
>Parquet file being scanned, it's safe to assume "a" is null in all records and 
>drop all of them. On the contrary, if "a IS NULL" is pushed down, all records 
>should be preserved.

Filed PARQUET-389 to track this issue.

> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaC

[jira] [Updated] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11103:
---
Assignee: Hyukjin Kwon

> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>Assignee: Hyukjin Kwon
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.sche

[jira] [Comment Edited] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-28 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977961#comment-14977961
 ] 

Cheng Lian edited comment on SPARK-11103 at 10/28/15 8:32 AM:
--

Quoted from my reply on the user list:

For 1: This one is pretty simple and safe, I'd like to have this for 1.5.2, or 
1.5.3 if we can't make it for 1.5.2.

For 2: I'd like to have this for Spark master. Actually we only need to 
calculate the intersection of all file schemata. We can make 
ParquetRelation.mergeSchemaInParallel return two StructTypes, the first one is 
the original merged schema, the other is the intersection of all file schemata, 
which only contains fields that exist in all file schemata. Then we decide 
which filter to pushed down according to the second StructType.

For 3:  The idea with which I came up at first was similar to this one. Instead 
of pulling all file schemata to driver side, we can push filter push-down code 
to executor side. Namely, passing candidate filters to executor side, and 
compute the Parquet filter predicates according to individual file schema. I 
haven't looked into this direction in depth, but we can probably put this part 
into CatalystReadSupport, which is now initialized on executor side.   However, 
correctness of this approach can only be guaranteed by the defensive filtering 
we do in Spark SQL (i.e. apply all the filters no matter they are pushed down 
or not), but we are considering to remove it because it imposes unnecessary 
performance cost. This makes me hesitant to go along this way.

>From my side, I think this is a bug of Parquet. Parquet was designed to 
>support schema evolution. When scanning a Parquet file, if a column exists in 
>the requested schema but is missing in the file schema, that column is filled 
>with null. This should also hold for pushed-down filter predicates. For 
>example, if filter "a = 1" is pushed down but column "a" doesn't exist in the 
>Parquet file being scanned, it's safe to assume "a" is null in all records and 
>drop all of them. On the contrary, if "a IS NULL" is pushed down, all records 
>should be preserved.

Filed PARQUET-389 to track this issue.


was (Author: lian cheng):
Quoted from my reply on the user list:

For 1: This one is pretty simple and safe, I'd like to have this for 1.5.2, or 
1.5.3 if we can't make it for 1.5.2.

For 2: Actually we only need to calculate the intersection of all file 
schemata. We can make ParquetRelation.mergeSchemaInParallel return two 
StructTypes, the first one is the original merged schema, the other is the 
intersection of all file schemata, which only contains fields that exist in all 
file schemata. Then we decide which filter to pushed down according to the 
second StructType.

For 3:  The idea with which I came up at first was similar to this one. Instead 
of pulling all file schemata to driver side, we can push filter push-down code 
to executor side. Namely, passing candidate filters to executor side, and 
compute the Parquet filter predicates according to individual file schema. I 
haven't looked into this direction in depth, but we can probably put this part 
into CatalystReadSupport, which is now initialized on executor side.   However, 
correctness of this approach can only be guaranteed by the defensive filtering 
we do in Spark SQL (i.e. apply all the filters no matter they are pushed down 
or not), but we are considering to remove it because it imposes unnecessary 
performance cost. This makes me hesitant to go along this way.

>From my side, I think this is a bug of Parquet. Parquet was designed to 
>support schema evolution. When scanning a Parquet file, if a column exists in 
>the requested schema but is missing in the file schema, that column is filled 
>with null. This should also hold for pushed-down filter predicates. For 
>example, if filter "a = 1" is pushed down but column "a" doesn't exist in the 
>Parquet file being scanned, it's safe to assume "a" is null in all records and 
>drop all of them. On the contrary, if "a IS NULL" is pushed down, all records 
>should be preserved.

Filed PARQUET-389 to track this issue.

> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>Assignee: Hyukjin Kwon
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to quer

[jira] [Updated] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11103:
---
Target Version/s: 1.5.3, 1.6.0  (was: 1.6.0)

> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>Assignee: Hyukjin Kwon
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at

[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2015-10-28 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977969#comment-14977969
 ] 

Saisai Shao commented on SPARK-2089:


Hi [~pwendell], [~mridulm80], [~sandyr] and [~lianhuiwang], I'm still thinking 
enabling this feature is quite useful, for many workloads like batch workload, 
which from my understanding dynamic allocation is not used necessarily, 
currently there's no way to provide locality hints for AM to allocate 
containers, this will affect the performance a lot if data is fetched remotely, 
especially in large Yarn cluster. So I'd incline to revive this feature, but 
maybe with another way (current way is so hadoop way and cannot be worked in 
yarn-client mode), also this locality computation can reuse the implementation 
of SPARK-4352, I'd like to spend some time to take crack at this issue, what's 
your opinion and concern? Is it still necessary to address this issue, since it 
is broken for many versions. Any comment is greatly appreciated, thanks a lot.

> With YARN, preferredNodeLocalityData isn't honored 
> ---
>
> Key: SPARK-2089
> URL: https://issues.apache.org/jira/browse/SPARK-2089
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Critical
>
> When running in YARN cluster mode, apps can pass preferred locality data when 
> constructing a Spark context that will dictate where to request executor 
> containers.
> This is currently broken because of a race condition.  The Spark-YARN code 
> runs the user class and waits for it to start up a SparkContext.  During its 
> initialization, the SparkContext will create a YarnClusterScheduler, which 
> notifies a monitor in the Spark-YARN code that .  The Spark-Yarn code then 
> immediately fetches the preferredNodeLocationData from the SparkContext and 
> uses it to start requesting containers.
> But in the SparkContext constructor that takes the preferredNodeLocationData, 
> setting preferredNodeLocationData comes after the rest of the initialization, 
> so, if the Spark-YARN code comes around quickly enough after being notified, 
> the data that's fetched is the empty unset version.  The occurred during all 
> of my runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11103:


Assignee: Apache Spark  (was: Hyukjin Kwon)

> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>Assignee: Apache Spark
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

[jira] [Assigned] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11103:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>Assignee: Hyukjin Kwon
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)

2015-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977976#comment-14977976
 ] 

Apache Spark commented on SPARK-11103:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/9327

> Filter applied on Merged Parquet shema with new column fail with 
> (java.lang.IllegalArgumentException: Column [column_name] was not found in 
> schema!)
> 
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>Assignee: Hyukjin Kwon
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd

[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-10-28 Thread Koert Kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978000#comment-14978000
 ] 

Koert Kuipers commented on SPARK-3655:
--

spark-sorted (https://github.com/tresata/spark-sorted) allows you to
process your data in a similar way to what you did below, but without
materializing the sorted data as a list in memory. yes it uses a custom
partitioner and yes it always shuffles the data.

are you sure the shuffle is your issue?
i assume your final output is not the sorted list? what do you do with the
sorted list after the steps shown below?
if your final output is RDD[(String, List[(Long, String)])] then there is
no way around materializing the list in memory and then spark-sorted will
not give you any benefit over what you did below.




> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11366) binary functions (methods) such as rlike in pyspark.sql.column only accept strings but should also accept another Column

2015-10-28 Thread Jose Antonio (JIRA)
Jose Antonio created SPARK-11366:


 Summary: binary functions (methods) such as rlike in 
pyspark.sql.column only accept strings but should also accept another Column
 Key: SPARK-11366
 URL: https://issues.apache.org/jira/browse/SPARK-11366
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.5.1, 1.5.0, 1.4.1, 1.4.0, 1.6.0
 Environment: All.
Reporter: Jose Antonio


binary functions (methods) such as rlike in pyspark.sql.column only accept 
strings but should also accept another Column.

For instance: 

df.mycolumn.rlike(df.mycolumn2)

Gives the error:
Method rlike([class org.apache.spark.sql.Column]) does not exist

but 

df.mycolumn.rlike('test')  

is OK.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11167) Incorrect type resolution on heterogeneous data structures

2015-10-28 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978007#comment-14978007
 ] 

Sun Rui commented on SPARK-11167:
-

It seems time-consuming and not desirable to going through the entire input 
data to infer the tightest type or to check if heterogeneous values exist. I 
agree with [~zero323] that it is practical to issue a warning in such case. 
Users should be responsible for homogeneous input data. Otherwise, there maybe 
runtime exception on type mismatch thrown by Spark.

> Incorrect type resolution on heterogeneous data structures
> --
>
> Key: SPARK-11167
> URL: https://issues.apache.org/jira/browse/SPARK-11167
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Maciej Szymkiewicz
>
> If structure contains heterogeneous incorrectly assigns type of the 
> encountered element as type of a whole structure. This problem affects both 
> lists:
> {code}
> SparkR:::infer_type(list(a=1, b="a")
> ## [1] "array"
> SparkR:::infer_type(list(a="a", b=1))
> ##  [1] "array"
> {code}
> and environments:
> {code}
> SparkR:::infer_type(as.environment(list(a=1, b="a")))
> ## [1] "map"
> SparkR:::infer_type(as.environment(list(a="a", b=1)))
> ## [1] "map"
> {code}
> This results in errors during data collection and other operations on 
> DataFrames:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$foo <- list(list("1", 2), list(3, 4))
> sdf <- createDataFrame(sqlContext, ldf)
> collect(sdf)
> ## 15/10/17 17:58:57 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 
> 9)
> ## scala.MatchError: 2.0 (of class java.lang.Double)
> ## ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11367) Python LinearRegression should support setting solver

2015-10-28 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-11367:
---

 Summary: Python LinearRegression should support setting solver
 Key: SPARK-11367
 URL: https://issues.apache.org/jira/browse/SPARK-11367
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang
Priority: Minor


Python LinearRegression should support setting solver("auto", "normal", 
"l-bfgs")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11368) Spark scan all partitions when using Python UDF and filter over partitioned column is given

2015-10-28 Thread JIRA
Maciej Bryński created SPARK-11368:
--

 Summary: Spark scan all partitions when using Python UDF and 
filter over partitioned column is given
 Key: SPARK-11368
 URL: https://issues.apache.org/jira/browse/SPARK-11368
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Maciej Bryński
Priority: Critical


Hi,
I think this is huge performance bug.

I created parquet file partitioned by column.
Then I make query with filter over partition column and filter with UDF.
Result is that all partition is scanned.

Sample data:
{code}
rdd = sc.parallelize(range(0,1000)).map(lambda x: 
Row(id=x%1000,value=x)).repartition(1)
df = sqlCtx.createDataFrame(rdd)
df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test')
df = sqlCtx.read.parquet('/mnt/mfs/udf_test')
df.registerTempTable('df')
{code}
Then queries:
Without udf - Spark reads only 10 partitions:
{code}
start = time.time()
sqlCtx.sql('select * from df where id >= 990 and value > 10').count()
print(time.time() - start)

0.9993703365325928
{code}
With udf Spark reads all the partitions:
{code}
sqlCtx.registerFunction('multiply2', lambda x: x *2 )
start = time.time()
sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > 
20').count()
print(time.time() - start)

13.0826096534729
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11367) Python LinearRegression should support setting solver

2015-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978037#comment-14978037
 ] 

Apache Spark commented on SPARK-11367:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9328

> Python LinearRegression should support setting solver
> -
>
> Key: SPARK-11367
> URL: https://issues.apache.org/jira/browse/SPARK-11367
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Python ML LinearRegression should support setting solver("auto", "normal", 
> "l-bfgs")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11367) Python LinearRegression should support setting solver

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11367:


Assignee: Apache Spark

> Python LinearRegression should support setting solver
> -
>
> Key: SPARK-11367
> URL: https://issues.apache.org/jira/browse/SPARK-11367
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> Python ML LinearRegression should support setting solver("auto", "normal", 
> "l-bfgs")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11367) Python LinearRegression should support setting solver

2015-10-28 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11367:

Description: Python ML LinearRegression should support setting 
solver("auto", "normal", "l-bfgs")  (was: Python LinearRegression should 
support setting solver("auto", "normal", "l-bfgs"))

> Python LinearRegression should support setting solver
> -
>
> Key: SPARK-11367
> URL: https://issues.apache.org/jira/browse/SPARK-11367
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Python ML LinearRegression should support setting solver("auto", "normal", 
> "l-bfgs")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11367) Python LinearRegression should support setting solver

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11367:


Assignee: (was: Apache Spark)

> Python LinearRegression should support setting solver
> -
>
> Key: SPARK-11367
> URL: https://issues.apache.org/jira/browse/SPARK-11367
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Python ML LinearRegression should support setting solver("auto", "normal", 
> "l-bfgs")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11367) Python LinearRegression should support setting solver

2015-10-28 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11367:

Description: SPARK-10668 has provided WeightedLeastSquares solver("normal") 
in LinearRegression with L2 regularization, Python ML LinearRegression should 
support setting solver("auto", "normal", "l-bfgs")  (was: SPARK-10668 has 
provide WeightedLeastSquares solver("normal") in LinearRegression with L2 
regularization, Python ML LinearRegression should support setting 
solver("auto", "normal", "l-bfgs"))

> Python LinearRegression should support setting solver
> -
>
> Key: SPARK-11367
> URL: https://issues.apache.org/jira/browse/SPARK-11367
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> SPARK-10668 has provided WeightedLeastSquares solver("normal") in 
> LinearRegression with L2 regularization, Python ML LinearRegression should 
> support setting solver("auto", "normal", "l-bfgs")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11368) Spark scan all partitions when using Python UDF and filter over partitioned column is given

2015-10-28 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-11368:
---
Description: 
Hi,
I think this is huge performance bug.

I created parquet file partitioned by column.
Then I make query with filter over partition column and filter with UDF.
Result is that all partition is scanned.

Sample data:
{code}
rdd = sc.parallelize(range(0,1000)).map(lambda x: 
Row(id=x%1000,value=x)).repartition(1)
df = sqlCtx.createDataFrame(rdd)
df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test')
df = sqlCtx.read.parquet('/mnt/mfs/udf_test')
df.registerTempTable('df')
{code}
Then queries:
Without udf - Spark reads only 10 partitions:
{code}
start = time.time()
sqlCtx.sql('select * from df where id >= 990 and value > 10').count()
print(time.time() - start)

0.9993703365325928

== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[count#22L])
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#25L])
   Project
Filter (value#5L > 10)
 Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L]

{code}
With udf Spark reads all the partitions:
{code}
sqlCtx.registerFunction('multiply2', lambda x: x *2 )
start = time.time()
sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > 
20').count()
print(time.time() - start)

13.0826096534729

== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[count#34L])
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#37L])
   TungstenProject
Filter ((id#6 > 990) && (cast(pythonUDF#33 as double) > 20.0))
 !BatchPythonEvaluation PythonUDF#multiply2(value#5L), 
[value#5L,id#6,pythonUDF#33]
  Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L,id#6]
{code}

  was:
Hi,
I think this is huge performance bug.

I created parquet file partitioned by column.
Then I make query with filter over partition column and filter with UDF.
Result is that all partition is scanned.

Sample data:
{code}
rdd = sc.parallelize(range(0,1000)).map(lambda x: 
Row(id=x%1000,value=x)).repartition(1)
df = sqlCtx.createDataFrame(rdd)
df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test')
df = sqlCtx.read.parquet('/mnt/mfs/udf_test')
df.registerTempTable('df')
{code}
Then queries:
Without udf - Spark reads only 10 partitions:
{code}
start = time.time()
sqlCtx.sql('select * from df where id >= 990 and value > 10').count()
print(time.time() - start)

0.9993703365325928
{code}
With udf Spark reads all the partitions:
{code}
sqlCtx.registerFunction('multiply2', lambda x: x *2 )
start = time.time()
sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > 
20').count()
print(time.time() - start)

13.0826096534729
{code}


> Spark scan all partitions when using Python UDF and filter over partitioned 
> column is given
> ---
>
> Key: SPARK-11368
> URL: https://issues.apache.org/jira/browse/SPARK-11368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I think this is huge performance bug.
> I created parquet file partitioned by column.
> Then I make query with filter over partition column and filter with UDF.
> Result is that all partition is scanned.
> Sample data:
> {code}
> rdd = sc.parallelize(range(0,1000)).map(lambda x: 
> Row(id=x%1000,value=x)).repartition(1)
> df = sqlCtx.createDataFrame(rdd)
> df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test')
> df = sqlCtx.read.parquet('/mnt/mfs/udf_test')
> df.registerTempTable('df')
> {code}
> Then queries:
> Without udf - Spark reads only 10 partitions:
> {code}
> start = time.time()
> sqlCtx.sql('select * from df where id >= 990 and value > 10').count()
> print(time.time() - start)
> 0.9993703365325928
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#22L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#25L])
>Project
> Filter (value#5L > 10)
>  Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L]
> {code}
> With udf Spark reads all the partitions:
> {code}
> sqlCtx.registerFunction('multiply2', lambda x: x *2 )
> start = time.time()
> sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > 
> 20').count()
> print(time.time() - start)
> 13.0826096534729
> == Physical Plan ==
> TungstenAggregate(key=[], func

[jira] [Updated] (SPARK-11367) Python LinearRegression should support setting solver

2015-10-28 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11367:

Description: SPARK-10668 has provide WeightedLeastSquares solver("normal") 
in LinearRegression with L2 regularization, Python ML LinearRegression should 
support setting solver("auto", "normal", "l-bfgs")  (was: Python ML 
LinearRegression should support setting solver("auto", "normal", "l-bfgs"))

> Python LinearRegression should support setting solver
> -
>
> Key: SPARK-11367
> URL: https://issues.apache.org/jira/browse/SPARK-11367
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> SPARK-10668 has provide WeightedLeastSquares solver("normal") in 
> LinearRegression with L2 regularization, Python ML LinearRegression should 
> support setting solver("auto", "normal", "l-bfgs")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11368) Spark scan all partitions when using Python UDF and filter over partitioned column is given

2015-10-28 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-11368:
---
Description: 
Hi,
I think this is huge performance bug.

I created parquet file partitioned by column.
Then I make query with filter over partition column and filter with UDF.
Result is that all partition are scanned.

Sample data:
{code}
rdd = sc.parallelize(range(0,1000)).map(lambda x: 
Row(id=x%1000,value=x)).repartition(1)
df = sqlCtx.createDataFrame(rdd)
df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test')
df = sqlCtx.read.parquet('/mnt/mfs/udf_test')
df.registerTempTable('df')
{code}
Then queries:
Without udf - Spark reads only 10 partitions:
{code}
start = time.time()
sqlCtx.sql('select * from df where id >= 990 and value > 10').count()
print(time.time() - start)

0.9993703365325928

== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[count#22L])
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#25L])
   Project
Filter (value#5L > 10)
 Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L]

{code}
With udf Spark reads all the partitions:
{code}
sqlCtx.registerFunction('multiply2', lambda x: x *2 )
start = time.time()
sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > 
20').count()
print(time.time() - start)

13.0826096534729

== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[count#34L])
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#37L])
   TungstenProject
Filter ((id#6 > 990) && (cast(pythonUDF#33 as double) > 20.0))
 !BatchPythonEvaluation PythonUDF#multiply2(value#5L), 
[value#5L,id#6,pythonUDF#33]
  Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L,id#6]
{code}

  was:
Hi,
I think this is huge performance bug.

I created parquet file partitioned by column.
Then I make query with filter over partition column and filter with UDF.
Result is that all partition is scanned.

Sample data:
{code}
rdd = sc.parallelize(range(0,1000)).map(lambda x: 
Row(id=x%1000,value=x)).repartition(1)
df = sqlCtx.createDataFrame(rdd)
df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test')
df = sqlCtx.read.parquet('/mnt/mfs/udf_test')
df.registerTempTable('df')
{code}
Then queries:
Without udf - Spark reads only 10 partitions:
{code}
start = time.time()
sqlCtx.sql('select * from df where id >= 990 and value > 10').count()
print(time.time() - start)

0.9993703365325928

== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[count#22L])
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#25L])
   Project
Filter (value#5L > 10)
 Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L]

{code}
With udf Spark reads all the partitions:
{code}
sqlCtx.registerFunction('multiply2', lambda x: x *2 )
start = time.time()
sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > 
20').count()
print(time.time() - start)

13.0826096534729

== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[count#34L])
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#37L])
   TungstenProject
Filter ((id#6 > 990) && (cast(pythonUDF#33 as double) > 20.0))
 !BatchPythonEvaluation PythonUDF#multiply2(value#5L), 
[value#5L,id#6,pythonUDF#33]
  Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L,id#6]
{code}


> Spark scan all partitions when using Python UDF and filter over partitioned 
> column is given
> ---
>
> Key: SPARK-11368
> URL: https://issues.apache.org/jira/browse/SPARK-11368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I think this is huge performance bug.
> I created parquet file partitioned by column.
> Then I make query with filter over partition column and filter with UDF.
> Result is that all partition are scanned.
> Sample data:
> {code}
> rdd = sc.parallelize(range(0,1000)).map(lambda x: 
> Row(id=x%1000,value=x)).repartition(1)
> df = sqlCtx.createDataFrame(rdd)
> df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test')
> df = sqlCtx.read.parquet('/mnt/mfs/udf_test')
> df.registerTempTable('df')
> {code}
> Then queries:
> Without udf - Spark reads only 10 partitio

[jira] [Updated] (SPARK-11367) Python LinearRegression should support setting solver

2015-10-28 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11367:

Description: SPARK-10668 has provided WeightedLeastSquares solver("normal") 
in LinearRegression with L2 regularization in Scala and R, Python ML 
LinearRegression should also support setting solver("auto", "normal", "l-bfgs") 
 (was: SPARK-10668 has provided WeightedLeastSquares solver("normal") in 
LinearRegression with L2 regularization, Python ML LinearRegression should 
support setting solver("auto", "normal", "l-bfgs"))

> Python LinearRegression should support setting solver
> -
>
> Key: SPARK-11367
> URL: https://issues.apache.org/jira/browse/SPARK-11367
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> SPARK-10668 has provided WeightedLeastSquares solver("normal") in 
> LinearRegression with L2 regularization in Scala and R, Python ML 
> LinearRegression should also support setting solver("auto", "normal", 
> "l-bfgs")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11369) SparkR glm should support setting standardize

2015-10-28 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-11369:
---

 Summary: SparkR glm should support setting standardize
 Key: SPARK-11369
 URL: https://issues.apache.org/jira/browse/SPARK-11369
 Project: Spark
  Issue Type: Improvement
  Components: ML, R
Reporter: Yanbo Liang
Priority: Minor


SparkR glm currently support :
formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0
We should also support setting standardize which has been defined at [design 
documentation|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#heading=h.b2igi6cx7yf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11370) fix a bug in GroupedIterator and create unit test for it

2015-10-28 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-11370:
---

 Summary: fix a bug in GroupedIterator and create unit test for it
 Key: SPARK-11370
 URL: https://issues.apache.org/jira/browse/SPARK-11370
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11370) fix a bug in GroupedIterator and create unit test for it

2015-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978125#comment-14978125
 ] 

Apache Spark commented on SPARK-11370:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9330

> fix a bug in GroupedIterator and create unit test for it
> 
>
> Key: SPARK-11370
> URL: https://issues.apache.org/jira/browse/SPARK-11370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11370) fix a bug in GroupedIterator and create unit test for it

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11370:


Assignee: (was: Apache Spark)

> fix a bug in GroupedIterator and create unit test for it
> 
>
> Key: SPARK-11370
> URL: https://issues.apache.org/jira/browse/SPARK-11370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11370) fix a bug in GroupedIterator and create unit test for it

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11370:


Assignee: Apache Spark

> fix a bug in GroupedIterator and create unit test for it
> 
>
> Key: SPARK-11370
> URL: https://issues.apache.org/jira/browse/SPARK-11370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver

2015-10-28 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978149#comment-14978149
 ] 

Yanbo Liang commented on SPARK-9836:


I will work on it.

> Provide R-like summary statistics for ordinary least squares via normal 
> equation solver
> ---
>
> Key: SPARK-9836
> URL: https://issues.apache.org/jira/browse/SPARK-9836
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> In R, model fitting comes with summary statistics. We can provide most of 
> those via normal equation solver (SPARK-9834). If some statistics requires 
> additional passes to the dataset, we can expose an option to let users select 
> desired statistics before model fitting. 
> {code}
> > summary(model)
> Call:
> glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
> Deviance Residuals: 
>  Min1QMedian3Q   Max  
> -1.30711  -0.25713  -0.05325   0.19542   1.41253  
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) 2.2514 0.3698   6.089 9.57e-09 ***
> Sepal.Width 0.8036 0.1063   7.557 4.19e-12 ***
> Speciesversicolor   1.4587 0.1121  13.012  < 2e-16 ***
> Speciesvirginica1.9468 0.1000  19.465  < 2e-16 ***
> ---
> Signif. codes:  
> 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> (Dispersion parameter for gaussian family taken to be 0.1918059)
> Null deviance: 102.168  on 149  degrees of freedom
> Residual deviance:  28.004  on 146  degrees of freedom
> AIC: 183.94
> Number of Fisher Scoring iterations: 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11371) Make "mean" an alias for "avg" operator

2015-10-28 Thread Ted Yu (JIRA)
Ted Yu created SPARK-11371:
--

 Summary: Make "mean" an alias for "avg" operator
 Key: SPARK-11371
 URL: https://issues.apache.org/jira/browse/SPARK-11371
 Project: Spark
  Issue Type: Improvement
Reporter: Ted Yu
Priority: Minor


>From Reynold in the thread 'Exception when using some aggregate operators'  
>(http://search-hadoop.com/m/q3RTt0xFr22nXB4/):

I don't think these are bugs. The SQL standard for average is "avg", not 
"mean". Similarly, a distinct count is supposed to be written as 
"count(distinct col)", not "countDistinct(col)".

We can, however, make "mean" an alias for "avg" to improve compatibility 
between DataFrame and SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11371) Make "mean" an alias for "avg" operator

2015-10-28 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-11371:
---
Attachment: spark-11371-v1.patch

> Make "mean" an alias for "avg" operator
> ---
>
> Key: SPARK-11371
> URL: https://issues.apache.org/jira/browse/SPARK-11371
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-11371-v1.patch
>
>
> From Reynold in the thread 'Exception when using some aggregate operators'  
> (http://search-hadoop.com/m/q3RTt0xFr22nXB4/):
> I don't think these are bugs. The SQL standard for average is "avg", not 
> "mean". Similarly, a distinct count is supposed to be written as 
> "count(distinct col)", not "countDistinct(col)".
> We can, however, make "mean" an alias for "avg" to improve compatibility 
> between DataFrame and SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11372) custom UDAF with StringType throws java.lang.ClassCastException

2015-10-28 Thread Pravin Gadakh (JIRA)
Pravin Gadakh created SPARK-11372:
-

 Summary: custom UDAF with StringType throws 
java.lang.ClassCastException
 Key: SPARK-11372
 URL: https://issues.apache.org/jira/browse/SPARK-11372
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Pravin Gadakh
Priority: Minor


Consider following custom UDAF which uses one StringType column as intermediate 
buffer's schema: 

{code:title=Merge.scala|borderStyle=solid}
class Merge extends UserDefinedAggregateFunction {

override def inputSchema: StructType = StructType(StructField("value", 
StringType) :: Nil)

override def update(buffer: MutableAggregationBuffer, input: Row): Unit 
= {
buffer(0) = if (buffer.getAs[String](0).isEmpty) {
input.getAs[String](0)
} else {
buffer.getAs[String](0) + "," + input.getAs[String](0)
}
}

override def bufferSchema: StructType = StructType(
StructField("merge", StringType) :: Nil
)

override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): 
Unit = {
buffer1(0) = if (buffer1.getAs[String](0).isEmpty) {
buffer2.getAs[String](0)
} else if (!buffer2.getAs[String](0).isEmpty) {
buffer1.getAs[String](0) + "," + 
buffer2.getAs[String](0)
}
}

override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = ""
}

override def deterministic: Boolean = true

override def evaluate(buffer: Row): Any = {
buffer.getAs[String](0)
}

override def dataType: DataType = StringType
}
{code}

Running the UDAF: 
{code}
  val df: DataFrame = sc.parallelize(List("abcd", "efgh", "hijk")).toDF("alpha")
  val merge = new Merge()
  df.groupBy().agg(merge(col("alpha")).as("Merge")).show()
{code}

Throw following exception: 

{code}
java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getUTF8String(rows.scala:247)
at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:373)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:362)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:141)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(

[jira] [Updated] (SPARK-11372) custom UDAF with StringType throws java.lang.ClassCastException

2015-10-28 Thread Pravin Gadakh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Gadakh updated SPARK-11372:
--
Description: 
Consider following custom UDAF which uses one StringType column as intermediate 
buffer's schema: 

{code:title=Merge.scala|borderStyle=solid}
class Merge extends UserDefinedAggregateFunction {

  override def inputSchema: StructType = StructType(StructField("value", 
StringType) :: Nil)

  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = if (buffer.getAs[String](0).isEmpty) {
  input.getAs[String](0)
} else {
  buffer.getAs[String](0) + "," + input.getAs[String](0)
}
  }

  override def bufferSchema: StructType = StructType(
StructField("merge", StringType) :: Nil
  )

  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = if (buffer1.getAs[String](0).isEmpty) {
  buffer2.getAs[String](0)
} else if (!buffer2.getAs[String](0).isEmpty) {
  buffer1.getAs[String](0) + "," + buffer2.getAs[String](0)
}
  }

  override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = ""
  }

  override def deterministic: Boolean = true

  override def evaluate(buffer: Row): Any = {
buffer.getAs[String](0)
  }

  override def dataType: DataType = StringType
}
{code}

Running the UDAF: 
{code}
  val df: DataFrame = sc.parallelize(List("abcd", "efgh", "hijk")).toDF("alpha")
  val merge = new Merge()
  df.groupBy().agg(merge(col("alpha")).as("Merge")).show()
{code}

Throw following exception: 

{code}
java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getUTF8String(rows.scala:247)
at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:373)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:362)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:141)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

  was:
Consider following custom UDAF which uses one StringType column as intermediate 
buffer's schema: 

{code:title=Merge.scala|borderStyle=solid}
class Merge extends UserDefinedAggregateFunction {

override def inputSchema: StructType = StructType(StructField("value", 
StringType) :: Nil)

override def update(buffer: MutableAggregationBuffer, input: Row): Unit 
= {
buffer(0) = if (buffer.getAs[String](0).isEmpty) {
 

[jira] [Updated] (SPARK-11372) custom UDAF with StringType throws java.lang.ClassCastException

2015-10-28 Thread Pravin Gadakh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Gadakh updated SPARK-11372:
--
Description: 
Consider following custom UDAF which uses one StringType column as intermediate 
buffer's schema: 

{code:title=Merge.scala|borderStyle=solid}
class Merge extends UserDefinedAggregateFunction {

  override def inputSchema: StructType = StructType(StructField("value", 
StringType) :: Nil)

  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = if (buffer.getAs[String](0).isEmpty) {
  input.getAs[String](0)
} else {
  buffer.getAs[String](0) + "," + input.getAs[String](0)
}
  }

  override def bufferSchema: StructType = StructType(
StructField("merge", StringType) :: Nil
  )

  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = if (buffer1.getAs[String](0).isEmpty) {
  buffer2.getAs[String](0)
} else if (!buffer2.getAs[String](0).isEmpty) {
  buffer1.getAs[String](0) + "," + buffer2.getAs[String](0)
}
  }

  override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = ""
  }

  override def deterministic: Boolean = true

  override def evaluate(buffer: Row): Any = {
buffer.getAs[String](0)
  }

  override def dataType: DataType = StringType
}
{code}

Running the UDAF: 
{code}
  val df: DataFrame = sc.parallelize(List("abcd", "efgh", "hijk")).toDF("alpha")
  val merge = new Merge()
  df.groupBy().agg(merge(col("alpha")).as("Merge")).show()
{code}

Throw following exception: 

{code}
java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
  at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getUTF8String(rows.scala:247)
  at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
  at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:373)
  at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:362)
  at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:141)
  at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
  at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
  at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
  at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
  at org.apache.spark.scheduler.Task.run(Task.scala:88)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
{code}

  was:
Consider following custom UDAF which uses one StringType column as intermediate 
buffer's schema: 

{code:title=Merge.scala|borderStyle=solid}
class Merge extends UserDefinedAggregateFunction {

  override def inputSchema: StructType = StructType(StructField("value", 
StringType) :: Nil)

  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = if (buffer.getAs[String](0).isEmpty) {
  input.getAs[String](0)
} else {
  buffer.getAs[String](0) + "," + input.getAs[String](0)
}
  }

  override def bufferSchema: StructType = StructType(
StructField("merge", StringType) :: Nil
  )

  ov

[jira] [Updated] (SPARK-11372) custom UDAF with StringType throws java.lang.ClassCastException

2015-10-28 Thread Pravin Gadakh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Gadakh updated SPARK-11372:
--
Description: 
Consider following custom UDAF which uses StringType column as intermediate 
buffer's schema: 

{code:title=Merge.scala|borderStyle=solid}
class Merge extends UserDefinedAggregateFunction {

  override def inputSchema: StructType = StructType(StructField("value", 
StringType) :: Nil)

  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = if (buffer.getAs[String](0).isEmpty) {
  input.getAs[String](0)
} else {
  buffer.getAs[String](0) + "," + input.getAs[String](0)
}
  }

  override def bufferSchema: StructType = StructType(
StructField("merge", StringType) :: Nil
  )

  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = if (buffer1.getAs[String](0).isEmpty) {
  buffer2.getAs[String](0)
} else if (!buffer2.getAs[String](0).isEmpty) {
  buffer1.getAs[String](0) + "," + buffer2.getAs[String](0)
}
  }

  override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = ""
  }

  override def deterministic: Boolean = true

  override def evaluate(buffer: Row): Any = {
buffer.getAs[String](0)
  }

  override def dataType: DataType = StringType
}
{code}

Running the UDAF: 
{code}
  val df: DataFrame = sc.parallelize(List("abcd", "efgh", "hijk")).toDF("alpha")
  val merge = new Merge()
  df.groupBy().agg(merge(col("alpha")).as("Merge")).show()
{code}

Throw following exception: 

{code}
java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
  at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getUTF8String(rows.scala:247)
  at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
  at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:373)
  at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:362)
  at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:141)
  at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
  at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
  at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
  at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
  at org.apache.spark.scheduler.Task.run(Task.scala:88)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
{code}

  was:
Consider following custom UDAF which uses one StringType column as intermediate 
buffer's schema: 

{code:title=Merge.scala|borderStyle=solid}
class Merge extends UserDefinedAggregateFunction {

  override def inputSchema: StructType = StructType(StructField("value", 
StringType) :: Nil)

  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = if (buffer.getAs[String](0).isEmpty) {
  input.getAs[String](0)
} else {
  buffer.getAs[String](0) + "," + input.getAs[String](0)
}
  }

  override def bufferSchema: StructType = StructType(
StructField("merge", StringType) :: Nil
  )

  overri

[jira] [Updated] (SPARK-11372) custom UDAF with StringType throws java.lang.ClassCastException

2015-10-28 Thread Pravin Gadakh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravin Gadakh updated SPARK-11372:
--
Description: 
Consider following custom UDAF which uses StringType column as intermediate 
buffer's schema: 

{code:title=Merge.scala|borderStyle=solid}
class Merge extends UserDefinedAggregateFunction {

  override def inputSchema: StructType = StructType(StructField("value", 
StringType) :: Nil)

  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = if (buffer.getAs[String](0).isEmpty) {
  input.getAs[String](0)
} else {
  buffer.getAs[String](0) + "," + input.getAs[String](0)
}
  }

  override def bufferSchema: StructType = StructType(
StructField("merge", StringType) :: Nil
  )

  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = if (buffer1.getAs[String](0).isEmpty) {
  buffer2.getAs[String](0)
} else if (!buffer2.getAs[String](0).isEmpty) {
  buffer1.getAs[String](0) + "," + buffer2.getAs[String](0)
}
  }

  override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = ""
  }

  override def deterministic: Boolean = true

  override def evaluate(buffer: Row): Any = {
buffer.getAs[String](0)
  }

  override def dataType: DataType = StringType
}
{code}

Running the UDAF: 
{code}
  val df: DataFrame = sc.parallelize(List("abcd", "efgh", "hijk")).toDF("alpha")
  val merge = new Merge()
  df.groupBy().agg(merge(col("alpha")).as("Merge")).show()
{code}

Throws following exception: 

{code}
java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
  at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getUTF8String(rows.scala:247)
  at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
  at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:373)
  at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:362)
  at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:141)
  at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
  at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
  at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
  at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
  at org.apache.spark.scheduler.Task.run(Task.scala:88)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
{code}

  was:
Consider following custom UDAF which uses StringType column as intermediate 
buffer's schema: 

{code:title=Merge.scala|borderStyle=solid}
class Merge extends UserDefinedAggregateFunction {

  override def inputSchema: StructType = StructType(StructField("value", 
StringType) :: Nil)

  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = if (buffer.getAs[String](0).isEmpty) {
  input.getAs[String](0)
} else {
  buffer.getAs[String](0) + "," + input.getAs[String](0)
}
  }

  override def bufferSchema: StructType = StructType(
StructField("merge", StringType) :: Nil
  )

  override 

[jira] [Updated] (SPARK-11332) WeightedLeastSquares should use ml features generic Instance class instead of private

2015-10-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11332:
--
Assignee: Nakul Jindal

> WeightedLeastSquares should use ml features generic Instance class instead of 
> private
> -
>
> Key: SPARK-11332
> URL: https://issues.apache.org/jira/browse/SPARK-11332
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: holdenk
>Assignee: Nakul Jindal
>Priority: Minor
> Fix For: 1.6.0
>
>
> WeightedLeastSquares should use the common Instance class in ml.feature 
> instead of a private one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11369) SparkR glm should support setting standardize

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11369:


Assignee: Apache Spark

> SparkR glm should support setting standardize
> -
>
> Key: SPARK-11369
> URL: https://issues.apache.org/jira/browse/SPARK-11369
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, R
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> SparkR glm currently support :
> formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0
> We should also support setting standardize which has been defined at [design 
> documentation|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#heading=h.b2igi6cx7yf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11369) SparkR glm should support setting standardize

2015-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978256#comment-14978256
 ] 

Apache Spark commented on SPARK-11369:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9331

> SparkR glm should support setting standardize
> -
>
> Key: SPARK-11369
> URL: https://issues.apache.org/jira/browse/SPARK-11369
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, R
>Reporter: Yanbo Liang
>Priority: Minor
>
> SparkR glm currently support :
> formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0
> We should also support setting standardize which has been defined at [design 
> documentation|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#heading=h.b2igi6cx7yf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11369) SparkR glm should support setting standardize

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11369:


Assignee: (was: Apache Spark)

> SparkR glm should support setting standardize
> -
>
> Key: SPARK-11369
> URL: https://issues.apache.org/jira/browse/SPARK-11369
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, R
>Reporter: Yanbo Liang
>Priority: Minor
>
> SparkR glm currently support :
> formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0
> We should also support setting standardize which has been defined at [design 
> documentation|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#heading=h.b2igi6cx7yf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11317) YARN HBase token code shouldn't swallow invocation target exceptions

2015-10-28 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-11317:
---
Affects Version/s: 1.5.1

> YARN HBase token code shouldn't swallow invocation target exceptions
> 
>
> Key: SPARK-11317
> URL: https://issues.apache.org/jira/browse/SPARK-11317
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Steve Loughran
>
> As with SPARK-11265; the HBase token retrieval code of SPARK-6918
> 1. swallows exceptions it should be rethrowing as serious problems (e.g 
> NoSuchMethodException)
> 1. Swallows any exception raised by the HBase client, without even logging 
> the details (it logs that an `InvocationTargetException` was caught, but not 
> the contents)
> As such it is potentially brittle to changes in the HBase client code, and 
> absolutely not going to provide any assistance if HBase won't actually issue 
> tokens to the caller.
> The code in SPARK-11265 can be re-used to provide consistent and better 
> exception processing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11317) YARN HBase token code shouldn't swallow invocation target exceptions

2015-10-28 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-11317:
---
Description: 
As with SPARK-11265; the HBase token retrieval code of SPARK-6918

1. swallows exceptions it should be rethrowing as serious problems (e.g 
NoSuchMethodException)
1. Swallows any exception raised by the HBase client, without even logging the 
details (it logs that an `InvocationTargetException` was caught, but not the 
contents)

As such it is potentially brittle to changes in the HBase client code, and 
absolutely not going to provide any assistance if HBase won't actually issue 
tokens to the caller.

The code in SPARK-11265 can be re-used to provide consistent and better 
exception processing

  was:
As with SPARK-11265; the HBase token retrieval code of SPARK-6918

1. swallows exceptions it should be rethrowing as serious problems (e.g 
NoSuchMethodException)
1. Swallows any exception raised by the HBase client, without even logging the 
details (it logs that an `InvocationTargetException` was caught, but not the 
contents)

As such it is potentially brittle to changes in the HDFS client code, and 
absolutely not going to provide any assistance if HBase won't actually issue 
tokens to the caller.

The code in SPARK-11265 can be re-used to provide consistent and better 
exception processing


> YARN HBase token code shouldn't swallow invocation target exceptions
> 
>
> Key: SPARK-11317
> URL: https://issues.apache.org/jira/browse/SPARK-11317
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Steve Loughran
>
> As with SPARK-11265; the HBase token retrieval code of SPARK-6918
> 1. swallows exceptions it should be rethrowing as serious problems (e.g 
> NoSuchMethodException)
> 1. Swallows any exception raised by the HBase client, without even logging 
> the details (it logs that an `InvocationTargetException` was caught, but not 
> the contents)
> As such it is potentially brittle to changes in the HBase client code, and 
> absolutely not going to provide any assistance if HBase won't actually issue 
> tokens to the caller.
> The code in SPARK-11265 can be re-used to provide consistent and better 
> exception processing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11373) Add metrics to the History Server and providers

2015-10-28 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-11373:
--

 Summary: Add metrics to the History Server and providers
 Key: SPARK-11373
 URL: https://issues.apache.org/jira/browse/SPARK-11373
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.5.1
Reporter: Steve Loughran


The History server doesn't publish metrics about JVM load or anything from the 
history provider plugins. This means that performance problems from massive job 
histories aren't visible to management tools, and nor are any 
provider-generated metrics such as time to load histories, failed history 
loads, the number of connectivity failures talking to remote services, etc.

If the history server set up a metrics registry and offered the option to 
publish its metrics, then management tools could view this data.

# the metrics registry would need to be passed down to the instantiated 
{{ApplicationHistoryProvider}}, in order for it to register its metrics.
# if the codahale metrics servlet were registered under a path such as 
{{/metrics}}, the values would be visible as HTML and JSON, without the need 
for management tools.
# Integration tests could also retrieve the JSON-formatted data and use it as 
part of the test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11373) Add metrics to the History Server and providers

2015-10-28 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978322#comment-14978322
 ] 

Steve Loughran commented on SPARK-11373:


# This has tangible benefit for the SPARK-1537 YARN ATS binding, because 
connectivity failures, GET performance and similar do surface. There are some 
{{AtomicLong}} counters in its {{YarnHistoryProvider}}, but I'm not planning to 
add counters and metrics until after that is checked in.
# All providers will benefit from the standard JVM performance counters, GC &c. 
# the FS history provider could also track time to list and load histories; 
time of last refresh, time to load most recent history, etc —information needed 
to identify where an unresponsive UI is getting its problems from.

> Add metrics to the History Server and providers
> ---
>
> Key: SPARK-11373
> URL: https://issues.apache.org/jira/browse/SPARK-11373
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Steve Loughran
>
> The History server doesn't publish metrics about JVM load or anything from 
> the history provider plugins. This means that performance problems from 
> massive job histories aren't visible to management tools, and nor are any 
> provider-generated metrics such as time to load histories, failed history 
> loads, the number of connectivity failures talking to remote services, etc.
> If the history server set up a metrics registry and offered the option to 
> publish its metrics, then management tools could view this data.
> # the metrics registry would need to be passed down to the instantiated 
> {{ApplicationHistoryProvider}}, in order for it to register its metrics.
> # if the codahale metrics servlet were registered under a path such as 
> {{/metrics}}, the values would be visible as HTML and JSON, without the need 
> for management tools.
> # Integration tests could also retrieve the JSON-formatted data and use it as 
> part of the test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11374) skip.header.line.count is ignored in HiveContext

2015-10-28 Thread Daniel Haviv (JIRA)
Daniel Haviv created SPARK-11374:


 Summary: skip.header.line.count is ignored in HiveContext
 Key: SPARK-11374
 URL: https://issues.apache.org/jira/browse/SPARK-11374
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Daniel Haviv


csv table in Hive which is configured to skip the header row using 
TBLPROPERTIES("skip.header.line.count"="1").

When querying from Hive the header row is not included in the data, but when 
running the same query via HiveContext I get the header row.

"show create table " via the HiveContext confirms that it is aware of the 
setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11375) History Server "no histories" message to be dynamically generated by ApplicationHistoryProviders

2015-10-28 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-11375:
--

 Summary: History Server "no histories" message to be dynamically 
generated by ApplicationHistoryProviders
 Key: SPARK-11375
 URL: https://issues.apache.org/jira/browse/SPARK-11375
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.5.1
Reporter: Steve Loughran
Priority: Minor


When there are no histories, the {{HistoryPage}} displays an error text which 
assumes that the provider is the {{FsHistoryProvider}}, and its sole failure 
mode is "directory not found"

{code}
Did you specify the correct logging directory?
Please verify your setting of spark.history.fs.logDirectory
{code}

Different providers have different failure modes, and even the filesystem 
provider has some, such as an access control exception, or the specified 
directly path actually being a file.

If the {{ApplicationHistoryProvider}} was itself asked to provide an error 
message, then it could
* be dynamically generated to show the current state of the history provider
* potentially include any exceptions to list
* display the actual values of settings such as the log directory property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11375) History Server "no histories" message to be dynamically generated by ApplicationHistoryProviders

2015-10-28 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978329#comment-14978329
 ] 

Steve Loughran commented on SPARK-11375:


This could be implemented with a new method on {{ApplicationHistoryProvider}}; 
something like

{code}

getDiagnosticsInfo(): (String, Option[String]) = { ... }
{code}

which would return two strings: one simple text, and one formatted text for 
insertion into a {{}} section. That would allow stack traces to be 
displayed readably.

A default implementation would simply return the current message.

Note that the HTML would have to be sanitized before display, with angle 
brackets escaped.

> History Server "no histories" message to be dynamically generated by 
> ApplicationHistoryProviders
> 
>
> Key: SPARK-11375
> URL: https://issues.apache.org/jira/browse/SPARK-11375
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Steve Loughran
>Priority: Minor
>
> When there are no histories, the {{HistoryPage}} displays an error text which 
> assumes that the provider is the {{FsHistoryProvider}}, and its sole failure 
> mode is "directory not found"
> {code}
> Did you specify the correct logging directory?
> Please verify your setting of spark.history.fs.logDirectory
> {code}
> Different providers have different failure modes, and even the filesystem 
> provider has some, such as an access control exception, or the specified 
> directly path actually being a file.
> If the {{ApplicationHistoryProvider}} was itself asked to provide an error 
> message, then it could
> * be dynamically generated to show the current state of the history provider
> * potentially include any exceptions to list
> * display the actual values of settings such as the log directory property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11313) Implement cogroup

2015-10-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11313.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9324
[https://github.com/apache/spark/pull/9324]

> Implement cogroup
> -
>
> Key: SPARK-11313
> URL: https://issues.apache.org/jira/browse/SPARK-11313
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11313) Implement cogroup

2015-10-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11313:
-
Assignee: Wenchen Fan

> Implement cogroup
> -
>
> Key: SPARK-11313
> URL: https://issues.apache.org/jira/browse/SPARK-11313
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-28 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978406#comment-14978406
 ] 

Michael Armbrust commented on SPARK-11303:
--

I picked it into branch-1.5, but I'm not sure if it made the cut off.  [~rxin]?

> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
> Fix For: 1.6.0
>
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> {code}
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> {code}
> output:
> {code}
> 14
> 7
> 8
> 14
> 7
> 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11368) Spark scan all partitions when using Python UDF and filter over partitioned column is given

2015-10-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978412#comment-14978412
 ] 

Maciej Bryński edited comment on SPARK-11368 at 10/28/15 1:27 PM:
--

Problem exists only when using Pyspark.

When I did the same in Scala console everything is OK.

{code}
scala> sqlContext.udf.register("multiply2", (x: Long) => x * 2)

scala> time(sqlContext.sql("select * from df where id > 990 and 
multiply2(value) > 20").count())
353633 microseconds

== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[count#32L])
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#35L])
   Project
Filter (UDF(value#0L) > 20)
 Scan ParquetRelation[file:/mnt/mfs/udf_test][value#0L]
{code}


was (Author: maver1ck):
Problem exists only when using Pyspark.

When I did the same in Scala console everything is OK.

{code}
scala> sqlContext.udf.register("multiply2", (x: Long) => x * 2

scala> time(sqlContext.sql("select * from df where id > 990 and 
multiply2(value) > 20").count())
353633 microseconds

== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[count#32L])
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#35L])
   Project
Filter (UDF(value#0L) > 20)
 Scan ParquetRelation[file:/mnt/mfs/udf_test][value#0L]
{code}

> Spark scan all partitions when using Python UDF and filter over partitioned 
> column is given
> ---
>
> Key: SPARK-11368
> URL: https://issues.apache.org/jira/browse/SPARK-11368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I think this is huge performance bug.
> I created parquet file partitioned by column.
> Then I make query with filter over partition column and filter with UDF.
> Result is that all partition are scanned.
> Sample data:
> {code}
> rdd = sc.parallelize(range(0,1000)).map(lambda x: 
> Row(id=x%1000,value=x)).repartition(1)
> df = sqlCtx.createDataFrame(rdd)
> df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test')
> df = sqlCtx.read.parquet('/mnt/mfs/udf_test')
> df.registerTempTable('df')
> {code}
> Then queries:
> Without udf - Spark reads only 10 partitions:
> {code}
> start = time.time()
> sqlCtx.sql('select * from df where id >= 990 and value > 10').count()
> print(time.time() - start)
> 0.9993703365325928
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#22L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#25L])
>Project
> Filter (value#5L > 10)
>  Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L]
> {code}
> With udf Spark reads all the partitions:
> {code}
> sqlCtx.registerFunction('multiply2', lambda x: x *2 )
> start = time.time()
> sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > 
> 20').count()
> print(time.time() - start)
> 13.0826096534729
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#34L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#37L])
>TungstenProject
> Filter ((id#6 > 990) && (cast(pythonUDF#33 as double) > 20.0))
>  !BatchPythonEvaluation PythonUDF#multiply2(value#5L), 
> [value#5L,id#6,pythonUDF#33]
>   Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L,id#6]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11368) Spark scan all partitions when using Python UDF and filter over partitioned column is given

2015-10-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978412#comment-14978412
 ] 

Maciej Bryński commented on SPARK-11368:


Problem exists only when using Pyspark.

When I did the same in Scala console everything is OK.

{code}
scala> sqlContext.udf.register("multiply2", (x: Long) => x * 2

scala> time(sqlContext.sql("select * from df where id > 990 and 
multiply2(value) > 20").count())
353633 microseconds

== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[count#32L])
 TungstenExchange SinglePartition
  TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#35L])
   Project
Filter (UDF(value#0L) > 20)
 Scan ParquetRelation[file:/mnt/mfs/udf_test][value#0L]
{code}

> Spark scan all partitions when using Python UDF and filter over partitioned 
> column is given
> ---
>
> Key: SPARK-11368
> URL: https://issues.apache.org/jira/browse/SPARK-11368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I think this is huge performance bug.
> I created parquet file partitioned by column.
> Then I make query with filter over partition column and filter with UDF.
> Result is that all partition are scanned.
> Sample data:
> {code}
> rdd = sc.parallelize(range(0,1000)).map(lambda x: 
> Row(id=x%1000,value=x)).repartition(1)
> df = sqlCtx.createDataFrame(rdd)
> df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test')
> df = sqlCtx.read.parquet('/mnt/mfs/udf_test')
> df.registerTempTable('df')
> {code}
> Then queries:
> Without udf - Spark reads only 10 partitions:
> {code}
> start = time.time()
> sqlCtx.sql('select * from df where id >= 990 and value > 10').count()
> print(time.time() - start)
> 0.9993703365325928
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#22L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#25L])
>Project
> Filter (value#5L > 10)
>  Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L]
> {code}
> With udf Spark reads all the partitions:
> {code}
> sqlCtx.registerFunction('multiply2', lambda x: x *2 )
> start = time.time()
> sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > 
> 20').count()
> print(time.time() - start)
> 13.0826096534729
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#34L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#37L])
>TungstenProject
> Filter ((id#6 > 990) && (cast(pythonUDF#33 as double) > 20.0))
>  !BatchPythonEvaluation PythonUDF#multiply2(value#5L), 
> [value#5L,id#6,pythonUDF#33]
>   Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L,id#6]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11368) Spark shouldn't scan all partitions when using Python UDF and filter over partitioned column is given

2015-10-28 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-11368:
---
Summary: Spark shouldn't scan all partitions when using Python UDF and 
filter over partitioned column is given  (was: Spark scan all partitions when 
using Python UDF and filter over partitioned column is given)

> Spark shouldn't scan all partitions when using Python UDF and filter over 
> partitioned column is given
> -
>
> Key: SPARK-11368
> URL: https://issues.apache.org/jira/browse/SPARK-11368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I think this is huge performance bug.
> I created parquet file partitioned by column.
> Then I make query with filter over partition column and filter with UDF.
> Result is that all partition are scanned.
> Sample data:
> {code}
> rdd = sc.parallelize(range(0,1000)).map(lambda x: 
> Row(id=x%1000,value=x)).repartition(1)
> df = sqlCtx.createDataFrame(rdd)
> df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test')
> df = sqlCtx.read.parquet('/mnt/mfs/udf_test')
> df.registerTempTable('df')
> {code}
> Then queries:
> Without udf - Spark reads only 10 partitions:
> {code}
> start = time.time()
> sqlCtx.sql('select * from df where id >= 990 and value > 10').count()
> print(time.time() - start)
> 0.9993703365325928
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#22L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#25L])
>Project
> Filter (value#5L > 10)
>  Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L]
> {code}
> With udf Spark reads all the partitions:
> {code}
> sqlCtx.registerFunction('multiply2', lambda x: x *2 )
> start = time.time()
> sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > 
> 20').count()
> print(time.time() - start)
> 13.0826096534729
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#34L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#37L])
>TungstenProject
> Filter ((id#6 > 990) && (cast(pythonUDF#33 as double) > 20.0))
>  !BatchPythonEvaluation PythonUDF#multiply2(value#5L), 
> [value#5L,id#6,pythonUDF#33]
>   Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L,id#6]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11371) Make "mean" an alias for "avg" operator

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11371:


Assignee: (was: Apache Spark)

> Make "mean" an alias for "avg" operator
> ---
>
> Key: SPARK-11371
> URL: https://issues.apache.org/jira/browse/SPARK-11371
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-11371-v1.patch
>
>
> From Reynold in the thread 'Exception when using some aggregate operators'  
> (http://search-hadoop.com/m/q3RTt0xFr22nXB4/):
> I don't think these are bugs. The SQL standard for average is "avg", not 
> "mean". Similarly, a distinct count is supposed to be written as 
> "count(distinct col)", not "countDistinct(col)".
> We can, however, make "mean" an alias for "avg" to improve compatibility 
> between DataFrame and SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11371) Make "mean" an alias for "avg" operator

2015-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978482#comment-14978482
 ] 

Apache Spark commented on SPARK-11371:
--

User 'ted-yu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9332

> Make "mean" an alias for "avg" operator
> ---
>
> Key: SPARK-11371
> URL: https://issues.apache.org/jira/browse/SPARK-11371
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-11371-v1.patch
>
>
> From Reynold in the thread 'Exception when using some aggregate operators'  
> (http://search-hadoop.com/m/q3RTt0xFr22nXB4/):
> I don't think these are bugs. The SQL standard for average is "avg", not 
> "mean". Similarly, a distinct count is supposed to be written as 
> "count(distinct col)", not "countDistinct(col)".
> We can, however, make "mean" an alias for "avg" to improve compatibility 
> between DataFrame and SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-28 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11376:
--

 Summary: Invalid generated Java code in GenerateColumnAccessor
 Key: SPARK-11376
 URL: https://issues.apache.org/jira/browse/SPARK-11376
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor


There are two {{mutableRow}} fields in the generated code within 
{{GenerateColumnAccessor.create()}} method. However, Janino happily accepted 
this code and operates normally. (That's why this bug is marked as minor.)

After some experiments, it turned out that Janino doesn't complain as long as 
all the fields with the same name are null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11377) withNewChildren should not convert StructType to Seq

2015-10-28 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-11377:


 Summary: withNewChildren should not convert StructType to Seq
 Key: SPARK-11377
 URL: https://issues.apache.org/jira/browse/SPARK-11377
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11371) Make "mean" an alias for "avg" operator

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11371:


Assignee: Apache Spark

> Make "mean" an alias for "avg" operator
> ---
>
> Key: SPARK-11371
> URL: https://issues.apache.org/jira/browse/SPARK-11371
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ted Yu
>Assignee: Apache Spark
>Priority: Minor
> Attachments: spark-11371-v1.patch
>
>
> From Reynold in the thread 'Exception when using some aggregate operators'  
> (http://search-hadoop.com/m/q3RTt0xFr22nXB4/):
> I don't think these are bugs. The SQL standard for average is "avg", not 
> "mean". Similarly, a distinct count is supposed to be written as 
> "count(distinct col)", not "countDistinct(col)".
> We can, however, make "mean" an alias for "avg" to improve compatibility 
> between DataFrame and SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11377) withNewChildren should not convert StructType to Seq

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11377:


Assignee: Michael Armbrust  (was: Apache Spark)

> withNewChildren should not convert StructType to Seq
> 
>
> Key: SPARK-11377
> URL: https://issues.apache.org/jira/browse/SPARK-11377
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11377) withNewChildren should not convert StructType to Seq

2015-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978505#comment-14978505
 ] 

Apache Spark commented on SPARK-11377:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/9334

> withNewChildren should not convert StructType to Seq
> 
>
> Key: SPARK-11377
> URL: https://issues.apache.org/jira/browse/SPARK-11377
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11377) withNewChildren should not convert StructType to Seq

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11377:


Assignee: Apache Spark  (was: Michael Armbrust)

> withNewChildren should not convert StructType to Seq
> 
>
> Key: SPARK-11377
> URL: https://issues.apache.org/jira/browse/SPARK-11377
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978509#comment-14978509
 ] 

Apache Spark commented on SPARK-11376:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/9335

> Invalid generated Java code in GenerateColumnAccessor
> -
>
> Key: SPARK-11376
> URL: https://issues.apache.org/jira/browse/SPARK-11376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> There are two {{mutableRow}} fields in the generated code within 
> {{GenerateColumnAccessor.create()}} method. However, Janino happily accepted 
> this code and operates normally. (That's why this bug is marked as minor.)
> After some experiments, it turned out that Janino doesn't complain as long as 
> all the fields with the same name are null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11376:


Assignee: Apache Spark  (was: Cheng Lian)

> Invalid generated Java code in GenerateColumnAccessor
> -
>
> Key: SPARK-11376
> URL: https://issues.apache.org/jira/browse/SPARK-11376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Minor
>
> There are two {{mutableRow}} fields in the generated code within 
> {{GenerateColumnAccessor.create()}} method. However, Janino happily accepted 
> this code and operates normally. (That's why this bug is marked as minor.)
> After some experiments, it turned out that Janino doesn't complain as long as 
> all the fields with the same name are null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11376:


Assignee: Cheng Lian  (was: Apache Spark)

> Invalid generated Java code in GenerateColumnAccessor
> -
>
> Key: SPARK-11376
> URL: https://issues.apache.org/jira/browse/SPARK-11376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> There are two {{mutableRow}} fields in the generated code within 
> {{GenerateColumnAccessor.create()}} method. However, Janino happily accepted 
> this code and operates normally. (That's why this bug is marked as minor.)
> After some experiments, it turned out that Janino doesn't complain as long as 
> all the fields with the same name are null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-28 Thread Saif Addin Ellafi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978528#comment-14978528
 ] 

Saif Addin Ellafi commented on SPARK-11330:
---

Hello Cheng Hao, and thank you very much for trying to reproduce. 

You are right, this does not happen on very small data. I started reproducing 
it with a slightly more complex +100 rows of data. This is the smallest I can 
do, I am not sure I understand the issue completely.

1. Put in json file

{"mdl_loan_state":"PPD00 ","filemonth_dt":"01NOV2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2014"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2008"}
{"mdl_loan_state":"CF0   ","filemonth_dt":"01OCT2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01AUG2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2010"}
{"mdl_loan_state":"PPD56 ","filemonth_dt":"01NOV2009"}
{"mdl_loan_state":"PPD3  ","filemonth_dt":"01SEP2012"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2012"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2012"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2012"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2014"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01AUG2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2014"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2014"}
{"mdl_loan_state":"CF0   ","filemonth_dt":"01NOV2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2014"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2012"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01NOV2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01AUG2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2012"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2012"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01AUG2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2008"}
{"mdl_loan_state":"PPD00 ","filem

[jira] [Updated] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-28 Thread Saif Addin Ellafi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saif Addin Ellafi updated SPARK-11330:
--
Attachment: bug_reproduce.zip

Json dataset folder

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
> Attachments: bug_reproduce.zip, bug_reproduce.zip
>
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res1: Int = 103



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11378) StreamingContext.awaitTerminationOrTimeout does not return

2015-10-28 Thread Nick Evans (JIRA)
Nick Evans created SPARK-11378:
--

 Summary: StreamingContext.awaitTerminationOrTimeout does not return
 Key: SPARK-11378
 URL: https://issues.apache.org/jira/browse/SPARK-11378
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Affects Versions: 1.5.1
Reporter: Nick Evans
Priority: Minor


The docs for {{SparkContext.awaitTerminationOrTimeout}} state it will "Return 
`true` if it's stopped; (...) or `false` if the waiting time elapsed before 
returning from the method."

This is currently not the case - the function does not return and thus any 
logic built on awaitTerminationOrTimeout will not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-28 Thread Saif Addin Ellafi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saif Addin Ellafi updated SPARK-11330:
--
Attachment: bug_reproduce.zip

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
> Attachments: bug_reproduce.zip
>
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res1: Int = 103



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11378) StreamingContext.awaitTerminationOrTimeout does not return

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11378:


Assignee: (was: Apache Spark)

> StreamingContext.awaitTerminationOrTimeout does not return
> --
>
> Key: SPARK-11378
> URL: https://issues.apache.org/jira/browse/SPARK-11378
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.5.1
>Reporter: Nick Evans
>Priority: Minor
>
> The docs for {{SparkContext.awaitTerminationOrTimeout}} state it will "Return 
> `true` if it's stopped; (...) or `false` if the waiting time elapsed before 
> returning from the method."
> This is currently not the case - the function does not return and thus any 
> logic built on awaitTerminationOrTimeout will not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11376:
---
Priority: Major  (was: Minor)

> Invalid generated Java code in GenerateColumnAccessor
> -
>
> Key: SPARK-11376
> URL: https://issues.apache.org/jira/browse/SPARK-11376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> There are two {{mutableRow}} fields in the generated code within 
> {{GenerateColumnAccessor.create()}} method. However, Janino happily accepted 
> this code and operates normally. (That's why this bug is marked as minor.)
> After some experiments, it turned out that Janino doesn't complain as long as 
> all the fields with the same name are null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-28 Thread Saif Addin Ellafi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978528#comment-14978528
 ] 

Saif Addin Ellafi edited comment on SPARK-11330 at 10/28/15 2:46 PM:
-

Hello Cheng Hao, and thank you very much for trying to reproduce. 

You are right, this does not happen on very small data. I started reproducing 
it with a slightly more complex +100 rows of data. This is the smallest I can 
do, I am not sure I understand the issue completely.

1. Put in json file (I have also attached a zip file with the dataset  in case 
copy paste broke something)

{"mdl_loan_state":"PPD00 ","filemonth_dt":"01NOV2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2014"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2008"}
{"mdl_loan_state":"CF0   ","filemonth_dt":"01OCT2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01AUG2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2010"}
{"mdl_loan_state":"PPD56 ","filemonth_dt":"01NOV2009"}
{"mdl_loan_state":"PPD3  ","filemonth_dt":"01SEP2012"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2012"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2012"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2012"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2014"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01AUG2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2014"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2014"}
{"mdl_loan_state":"CF0   ","filemonth_dt":"01NOV2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2014"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2012"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01NOV2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01AUG2011"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2012"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2013"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2012"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2010"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2008"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2009"}
{"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2008"}
{"md

[jira] [Assigned] (SPARK-11378) StreamingContext.awaitTerminationOrTimeout does not return

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11378:


Assignee: Apache Spark

> StreamingContext.awaitTerminationOrTimeout does not return
> --
>
> Key: SPARK-11378
> URL: https://issues.apache.org/jira/browse/SPARK-11378
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.5.1
>Reporter: Nick Evans
>Assignee: Apache Spark
>Priority: Minor
>
> The docs for {{SparkContext.awaitTerminationOrTimeout}} state it will "Return 
> `true` if it's stopped; (...) or `false` if the waiting time elapsed before 
> returning from the method."
> This is currently not the case - the function does not return and thus any 
> logic built on awaitTerminationOrTimeout will not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor

2015-10-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11376:
---
Description: 
There are two {{mutableRow}} fields in the generated code within 
{{GenerateColumnAccessor.create()}} method. However, Janino happily accepted 
this code and accidentally operates normally.

After some experiments, it turned out that Janino doesn't complain as long as 
all the fields with the same name are null.

  was:
There are two {{mutableRow}} fields in the generated code within 
{{GenerateColumnAccessor.create()}} method. However, Janino happily accepted 
this code and operates normally. (That's why this bug is marked as minor.)

After some experiments, it turned out that Janino doesn't complain as long as 
all the fields with the same name are null.


> Invalid generated Java code in GenerateColumnAccessor
> -
>
> Key: SPARK-11376
> URL: https://issues.apache.org/jira/browse/SPARK-11376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> There are two {{mutableRow}} fields in the generated code within 
> {{GenerateColumnAccessor.create()}} method. However, Janino happily accepted 
> this code and accidentally operates normally.
> After some experiments, it turned out that Janino doesn't complain as long as 
> all the fields with the same name are null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11378) StreamingContext.awaitTerminationOrTimeout does not return

2015-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978541#comment-14978541
 ] 

Apache Spark commented on SPARK-11378:
--

User 'manygrams' has created a pull request for this issue:
https://github.com/apache/spark/pull/9336

> StreamingContext.awaitTerminationOrTimeout does not return
> --
>
> Key: SPARK-11378
> URL: https://issues.apache.org/jira/browse/SPARK-11378
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.5.1
>Reporter: Nick Evans
>Priority: Minor
>
> The docs for {{SparkContext.awaitTerminationOrTimeout}} state it will "Return 
> `true` if it's stopped; (...) or `false` if the waiting time elapsed before 
> returning from the method."
> This is currently not the case - the function does not return and thus any 
> logic built on awaitTerminationOrTimeout will not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results

2015-10-28 Thread Saif Addin Ellafi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saif Addin Ellafi updated SPARK-11330:
--
Attachment: (was: bug_reproduce.zip)

> Filter operation on StringType after groupBy PERSISTED brings no results
> 
>
> Key: SPARK-11330
> URL: https://issues.apache.org/jira/browse/SPARK-11330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
> Environment: Stand alone Cluster of five servers (happens as well in 
> local mode). sqlContext instance of HiveContext (happens as well with 
> SQLContext)
> No special options other than driver memory and executor memory.
> Parquet partitions are 512 where there are 160 cores. Happens as well with 
> other partitioning
> Data is nearly 2 billion rows.
>Reporter: Saif Addin Ellafi
>Priority: Blocker
> Attachments: bug_reproduce.zip
>
>
> ONLY HAPPENS WHEN PERSIST() IS CALLED
> val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt")
> data.groupBy("vintages").count.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> res9: org.apache.spark.sql.Row = [2007-01-01]
> data.groupBy("vintages").count.persist.select("vintages").filter("vintages = 
> '2007-01-01'").first
> >>> Exception on empty iterator stuff
> This does not happen if using another type of field, eg IntType
> data.groupBy("mm").count.persist.select("mm").filter("mm = 
> 200805").first >>> res13: org.apache.spark.sql.Row = [200805]
> NOTE: I have no idea whether I used KRYO serializer when stored this parquet.
> NOTE2: If setting the persist after the filtering, it works fine. But this is 
> not a good enough workaround since any filter operation afterwards will break 
> results.
> NOTE3: I have reproduced the issue with several different datasets.
> NOTE4: The only real workaround is to store the groupBy dataframe in database 
> and reload it as a new dataframe.
> Query to raw-data works fine:
> data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: 
> org.apache.spark.sql.Row = [2007-01-01]
> Originally, the issue happened with a larger aggregation operation, the 
> result was that data was inconsistent bringing different results every call.
> Reducing the operation step by step, I got into this issue.
> In any case, the original operation was:
> val data = sqlContext.read.parquet("/var/Saif/data_pqt")
> val res = data.groupBy("product", "band", "age", "vint", "mb", 
> "mm").agg(count($"account_id").as("N"), 
> sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), 
> sum($"spend").as("spend"), sum($"payment").as("payment"), 
> sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" 
> === 1).as("newacct")).persist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res0: Int = 102
> res.unpersist()
> val z = res.select("vint", "mm").filter("vint = 
> '2007-01-01'").select("mm").distinct.collect
> z.length
> >>> res1: Int = 103



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver

2015-10-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9836:
-
Assignee: Yanbo Liang

> Provide R-like summary statistics for ordinary least squares via normal 
> equation solver
> ---
>
> Key: SPARK-9836
> URL: https://issues.apache.org/jira/browse/SPARK-9836
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> In R, model fitting comes with summary statistics. We can provide most of 
> those via normal equation solver (SPARK-9834). If some statistics requires 
> additional passes to the dataset, we can expose an option to let users select 
> desired statistics before model fitting. 
> {code}
> > summary(model)
> Call:
> glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
> Deviance Residuals: 
>  Min1QMedian3Q   Max  
> -1.30711  -0.25713  -0.05325   0.19542   1.41253  
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) 2.2514 0.3698   6.089 9.57e-09 ***
> Sepal.Width 0.8036 0.1063   7.557 4.19e-12 ***
> Speciesversicolor   1.4587 0.1121  13.012  < 2e-16 ***
> Speciesvirginica1.9468 0.1000  19.465  < 2e-16 ***
> ---
> Signif. codes:  
> 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> (Dispersion parameter for gaussian family taken to be 0.1918059)
> Null deviance: 102.168  on 149  degrees of freedom
> Residual deviance:  28.004  on 146  degrees of freedom
> AIC: 183.94
> Number of Fisher Scoring iterations: 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver

2015-10-28 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978600#comment-14978600
 ] 

Xiangrui Meng commented on SPARK-9836:
--

[~yanboliang] Note that the feature freeze deadline for 1.6 is the end of the 
month. You can check the implementation in 
https://github.com/AlteryxLabs/sparkGLM and the unit tests should be verified 
against R lm/glm. cc [~cafreeman] and Dan Putler.

> Provide R-like summary statistics for ordinary least squares via normal 
> equation solver
> ---
>
> Key: SPARK-9836
> URL: https://issues.apache.org/jira/browse/SPARK-9836
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> In R, model fitting comes with summary statistics. We can provide most of 
> those via normal equation solver (SPARK-9834). If some statistics requires 
> additional passes to the dataset, we can expose an option to let users select 
> desired statistics before model fitting. 
> {code}
> > summary(model)
> Call:
> glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
> Deviance Residuals: 
>  Min1QMedian3Q   Max  
> -1.30711  -0.25713  -0.05325   0.19542   1.41253  
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) 2.2514 0.3698   6.089 9.57e-09 ***
> Sepal.Width 0.8036 0.1063   7.557 4.19e-12 ***
> Speciesversicolor   1.4587 0.1121  13.012  < 2e-16 ***
> Speciesvirginica1.9468 0.1000  19.465  < 2e-16 ***
> ---
> Signif. codes:  
> 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> (Dispersion parameter for gaussian family taken to be 0.1918059)
> Null deviance: 102.168  on 149  degrees of freedom
> Residual deviance:  28.004  on 146  degrees of freedom
> AIC: 183.94
> Number of Fisher Scoring iterations: 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11337) Make example code in user guide testable

2015-10-28 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978603#comment-14978603
 ] 

Xiangrui Meng commented on SPARK-11337:
---

How about per markdown file? I don't want to create too many sub-tasks. This is 
documentation work. We can make changes during the QA month for 1.6.

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-10-28 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978607#comment-14978607
 ] 

swetha k commented on SPARK-3655:
-

[~koert]
The final output for this RDD is RDD[(String, List[(Long, String)])] . But, I 
call updateStateByKey on this RDD. Inside updateStateByKey, I process this list 
and put all the data in a single object which gets merged with the old state 
for this 
session. After the updateStateByKey, I will return objects for the session that 
represents the current batch and the  merged batch.

> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11379) ExpressionEncoder can't handle top level primitive type correctly

2015-10-28 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-11379:
---

 Summary: ExpressionEncoder can't handle top level primitive type 
correctly
 Key: SPARK-11379
 URL: https://issues.apache.org/jira/browse/SPARK-11379
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-10-28 Thread swetha k (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978610#comment-14978610
 ] 

swetha k commented on SPARK-3655:
-

[~koert]

If I don't put the list as a materialized view in memory, what is the 
appropriate way to use Spark-Sorted to just group and sort the batch of Jsons 
based on the key(sessionId)

> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11380) Replace example code in mllib-frequent-pattern-mining.md using include_example

2015-10-28 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-11380:
-

 Summary: Replace example code in mllib-frequent-pattern-mining.md 
using include_example
 Key: SPARK-11380
 URL: https://issues.apache.org/jira/browse/SPARK-11380
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng


This is similar to SPARK-11289 but for the example code in 
mllib-frequent-pattern-mining.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11337) Make example code in user guide testable

2015-10-28 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978614#comment-14978614
 ] 

Xiangrui Meng commented on SPARK-11337:
---

[~yinxusen] I created one sub-task (SPARK-11380) as a template. You can create 
more and remember to label them `starter`.

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11379) ExpressionEncoder can't handle top level primitive type correctly

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11379:


Assignee: (was: Apache Spark)

> ExpressionEncoder can't handle top level primitive type correctly
> -
>
> Key: SPARK-11379
> URL: https://issues.apache.org/jira/browse/SPARK-11379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11379) ExpressionEncoder can't handle top level primitive type correctly

2015-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978616#comment-14978616
 ] 

Apache Spark commented on SPARK-11379:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9337

> ExpressionEncoder can't handle top level primitive type correctly
> -
>
> Key: SPARK-11379
> URL: https://issues.apache.org/jira/browse/SPARK-11379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver

2015-10-28 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978615#comment-14978615
 ] 

Xiangrui Meng commented on SPARK-9836:
--

Sorry for late response! There are more starter tasks coming out under 
SPARK-11337. Are you interested?

> Provide R-like summary statistics for ordinary least squares via normal 
> equation solver
> ---
>
> Key: SPARK-9836
> URL: https://issues.apache.org/jira/browse/SPARK-9836
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> In R, model fitting comes with summary statistics. We can provide most of 
> those via normal equation solver (SPARK-9834). If some statistics requires 
> additional passes to the dataset, we can expose an option to let users select 
> desired statistics before model fitting. 
> {code}
> > summary(model)
> Call:
> glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
> Deviance Residuals: 
>  Min1QMedian3Q   Max  
> -1.30711  -0.25713  -0.05325   0.19542   1.41253  
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> (Intercept) 2.2514 0.3698   6.089 9.57e-09 ***
> Sepal.Width 0.8036 0.1063   7.557 4.19e-12 ***
> Speciesversicolor   1.4587 0.1121  13.012  < 2e-16 ***
> Speciesvirginica1.9468 0.1000  19.465  < 2e-16 ***
> ---
> Signif. codes:  
> 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> (Dispersion parameter for gaussian family taken to be 0.1918059)
> Null deviance: 102.168  on 149  degrees of freedom
> Residual deviance:  28.004  on 146  degrees of freedom
> AIC: 183.94
> Number of Fisher Scoring iterations: 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11379) ExpressionEncoder can't handle top level primitive type correctly

2015-10-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11379:


Assignee: Apache Spark

> ExpressionEncoder can't handle top level primitive type correctly
> -
>
> Key: SPARK-11379
> URL: https://issues.apache.org/jira/browse/SPARK-11379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11369) SparkR glm should support setting standardize

2015-10-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11369:
--
Target Version/s: 1.6.0

> SparkR glm should support setting standardize
> -
>
> Key: SPARK-11369
> URL: https://issues.apache.org/jira/browse/SPARK-11369
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, R
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> SparkR glm currently support :
> formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0
> We should also support setting standardize which has been defined at [design 
> documentation|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#heading=h.b2igi6cx7yf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11369) SparkR glm should support setting standardize

2015-10-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11369:
--
Assignee: Yanbo Liang

> SparkR glm should support setting standardize
> -
>
> Key: SPARK-11369
> URL: https://issues.apache.org/jira/browse/SPARK-11369
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, R
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> SparkR glm currently support :
> formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0
> We should also support setting standardize which has been defined at [design 
> documentation|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#heading=h.b2igi6cx7yf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11369) SparkR glm should support setting standardize

2015-10-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11369.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9331
[https://github.com/apache/spark/pull/9331]

> SparkR glm should support setting standardize
> -
>
> Key: SPARK-11369
> URL: https://issues.apache.org/jira/browse/SPARK-11369
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, R
>Reporter: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> SparkR glm currently support :
> formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0
> We should also support setting standardize which has been defined at [design 
> documentation|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#heading=h.b2igi6cx7yf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11367) Python LinearRegression should support setting solver

2015-10-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11367.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9328
[https://github.com/apache/spark/pull/9328]

> Python LinearRegression should support setting solver
> -
>
> Key: SPARK-11367
> URL: https://issues.apache.org/jira/browse/SPARK-11367
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> SPARK-10668 has provided WeightedLeastSquares solver("normal") in 
> LinearRegression with L2 regularization in Scala and R, Python ML 
> LinearRegression should also support setting solver("auto", "normal", 
> "l-bfgs")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >