[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project
[ https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977874#comment-14977874 ] Dibyendu Bhattacharya commented on SPARK-11045: --- hi [~tdas] , let me know what is your comment on this Jira. Do you think this should continue to be part of Spark Packages or good to include into Spark Project so that community will get reliable and better alternative of Direct API ( if at all someone does not want Direct API) ? > Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to > Apache Spark Project > > > Key: SPARK-11045 > URL: https://issues.apache.org/jira/browse/SPARK-11045 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Dibyendu Bhattacharya > > This JIRA is to track the progress of making the Receiver based Low Level > Kafka Consumer from spark-packages > (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be > contributed back to Apache Spark Project. > This Kafka consumer has been around for more than year and has matured over > the time . I see there are many adoptions of this package . I receive > positive feedbacks that this consumer gives better performance and fault > tolerant capabilities. > This is the primary intent of this JIRA to give community a better > alternative if they want to use Receiver Base model. > If this consumer make it to Spark Core, it will definitely see more adoption > and support from community and help many who still prefer the Receiver Based > model of Kafka Consumer. > I understand the Direct Stream is the consumer which can give Exact Once > semantics and uses Kafka Low Level API , which is good . But Direct Stream > has concerns around recovering checkpoint on driver code change . Application > developer need to manage their own offset which complex . Even if some one > does manages their own offset , it limits the parallelism Spark Streaming can > achieve. If someone wants more parallelism and want > spark.streaming.concurrentJobs more than 1 , you can no longer rely on > storing offset externally as you have no control which batch will run in > which sequence. > Furthermore , the Direct Stream has higher latency , as it fetch messages > form Kafka during RDD action . Also number of RDD partitions are limited to > topic partition . So unless your Kafka topic does not have enough partitions, > you have limited parallelism while RDD processing. > Due to above mentioned concerns , many people who does not want Exactly Once > semantics , still prefer Receiver based model. Unfortunately, when customer > fall back to KafkaUtil.CreateStream approach, which use Kafka High Level > Consumer, there are other issues around the reliability of Kafka High Level > API. Kafka High Level API is buggy and has serious issue around Consumer > Re-balance. Hence I do not think this is correct to advice people to use > KafkaUtil.CreateStream in production . > The better option presently is there is to use the Consumer from > spark-packages . It is is using Kafka Low Level Consumer API , store offset > in Zookeeper, and can recover from any failure . Below are few highlights of > this consumer .. > 1. It has a inbuilt PID Controller for dynamic rate limiting. > 2. In this consumer , The Rate Limiting is done by modifying the size blocks > by controlling the size of messages pulled from Kafka. Whereas , in Spark the > Rate Limiting is done by controlling number of messages. The issue with > throttling by number of message is, if message size various, block size will > also vary . Let say your Kafka has messages for different sizes from 10KB to > 500 KB. Thus throttling by number of message can never give any deterministic > size of your block hence there is no guarantee that Memory Back-Pressure can > really take affect. > 3. This consumer is using Kafka low level API which gives better performance > than KafkaUtils.createStream based High Level API. > 4. This consumer can give end to end no data loss channel if enabled with WAL. > By accepting this low level kafka consumer from spark packages to apache > spark project , we will give community a better options for Kafka > connectivity both for Receiver less and Receiver based model. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json
[ https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-10517: --- Affects Version/s: (was: 1.5.0) 1.5.1 > Console "Output" field is empty when using DataFrameWriter.json > --- > > Key: SPARK-10517 > URL: https://issues.apache.org/jira/browse/SPARK-10517 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.1 >Reporter: Maciej Bryński >Priority: Minor > Attachments: screenshot-1.png > > > On HTTP application UI "Output" field is empty when using > DataFrameWriter.json. > Should by size of written bytes. > Screenshot attached, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json
[ https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977890#comment-14977890 ] Maciej Bryński commented on SPARK-10517: No. I think that output field is always empty. > Console "Output" field is empty when using DataFrameWriter.json > --- > > Key: SPARK-10517 > URL: https://issues.apache.org/jira/browse/SPARK-10517 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.1 >Reporter: Maciej Bryński >Priority: Minor > Attachments: screenshot-1.png > > > On HTTP application UI "Output" field is empty when using > DataFrameWriter.json. > Should by size of written bytes. > Screenshot attached, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11332) WeightedLeastSquares should use ml features generic Instance class instead of private
[ https://issues.apache.org/jira/browse/SPARK-11332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977891#comment-14977891 ] Sean Owen commented on SPARK-11332: --- They need to be a Contributor in JIRA. I can add this person. If you end up regularly adding new people we can make you an admin. > WeightedLeastSquares should use ml features generic Instance class instead of > private > - > > Key: SPARK-11332 > URL: https://issues.apache.org/jira/browse/SPARK-11332 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: holdenk >Priority: Minor > > WeightedLeastSquares should use the common Instance class in ml.feature > instead of a private one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11332) WeightedLeastSquares should use ml features generic Instance class instead of private
[ https://issues.apache.org/jira/browse/SPARK-11332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977893#comment-14977893 ] DB Tsai commented on SPARK-11332: - Thanks. Please help me to add his as contributor. > WeightedLeastSquares should use ml features generic Instance class instead of > private > - > > Key: SPARK-11332 > URL: https://issues.apache.org/jira/browse/SPARK-11332 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: holdenk >Priority: Minor > > WeightedLeastSquares should use the common Instance class in ml.feature > instead of a private one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11365) consolidate aggregates for summary statistics in weighted least squares
holdenk created SPARK-11365: --- Summary: consolidate aggregates for summary statistics in weighted least squares Key: SPARK-11365 URL: https://issues.apache.org/jira/browse/SPARK-11365 Project: Spark Issue Type: Improvement Reporter: holdenk Priority: Minor Right now we duplicate some aggregates in the aggregator, we could simplify this a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11332) WeightedLeastSquares should use ml features generic Instance class instead of private
[ https://issues.apache.org/jira/browse/SPARK-11332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-11332. - Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9325 [https://github.com/apache/spark/pull/9325] > WeightedLeastSquares should use ml features generic Instance class instead of > private > - > > Key: SPARK-11332 > URL: https://issues.apache.org/jira/browse/SPARK-11332 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: holdenk >Priority: Minor > Fix For: 1.6.0 > > > WeightedLeastSquares should use the common Instance class in ml.feature > instead of a private one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11246) [1.5] Table cache for Parquet broken in 1.5
[ https://issues.apache.org/jira/browse/SPARK-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977940#comment-14977940 ] Apache Spark commented on SPARK-11246: -- User 'xwu0226' has created a pull request for this issue: https://github.com/apache/spark/pull/9326 > [1.5] Table cache for Parquet broken in 1.5 > --- > > Key: SPARK-11246 > URL: https://issues.apache.org/jira/browse/SPARK-11246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: David Ross > > Since upgrading to 1.5.1, using the {{CACHE TABLE}} works great for all > tables except for parquet tables, likely related to the parquet native reader. > Here are steps for parquet table: > {code} > create table test_parquet stored as parquet as select 1; > explain select * from test_parquet; > {code} > With output: > {code} > == Physical Plan == > Scan > ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#141] > {code} > And then caching: > {code} > cache table test_parquet; > explain select * from test_parquet; > {code} > With output: > {code} > == Physical Plan == > Scan > ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#174] > {code} > Note it isn't cached. I have included spark log output for the {{cache > table}} and {{explain}} statements below. > --- > Here's the same for non-parquet table: > {code} > cache table test_no_parquet; > explain select * from test_no_parquet; > {code} > With output: > {code} > == Physical Plan == > HiveTableScan [_c0#210], (MetastoreRelation default, test_no_parquet, None) > {code} > And then caching: > {code} > cache table test_no_parquet; > explain select * from test_no_parquet; > {code} > With output: > {code} > == Physical Plan == > InMemoryColumnarTableScan [_c0#229], (InMemoryRelation [_c0#229], true, > 1, StorageLevel(true, true, false, true, 1), (HiveTableScan [_c0#211], > (MetastoreRelation default, test_no_parquet, None)), Some(test_no_parquet)) > {code} > Not that the table seems to be cached. > --- > Note that if the flag {{spark.sql.hive.convertMetastoreParquet}} is set to > {{false}}, parquet tables work the same as non-parquet tables with caching. > This is a reasonable workaround for us, but ideally, we would like to benefit > from the native reading. > --- > Spark logs for {{cache table}} for {{test_parquet}}: > {code} > 15/10/21 21:22:05 INFO thriftserver.SparkExecuteStatementOperation: Running > query 'cache table test_parquet' with 20ee2ab9-5242-4783-81cf-46115ed72610 > 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default > tbl=test_parquet > 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant > ip=unknown-ip-addr cmd=get_table : db=default tbl=test_parquet > 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: Opening raw store with > implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 15/10/21 21:22:05 INFO metastore.ObjectStore: ObjectStore, initialize called > 15/10/21 21:22:05 INFO DataNucleus.Query: Reading in results for query > "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is > closing > 15/10/21 21:22:05 INFO metastore.MetaStoreDirectSql: Using direct SQL, > underlying DB is MYSQL > 15/10/21 21:22:05 INFO metastore.ObjectStore: Initialized ObjectStore > 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called > with curMem=4196713, maxMem=139009720 > 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59 stored as > values in memory (estimated size 210.6 KB, free 128.4 MB) > 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(20265) called > with curMem=4412393, maxMem=139009720 > 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59_piece0 stored > as bytes in memory (estimated size 19.8 KB, free 128.3 MB) > 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Added broadcast_59_piece0 in > memory on 192.168.99.9:50262 (size: 19.8 KB, free: 132.2 MB) > 15/10/21 21:22:05 INFO spark.SparkContext: Created broadcast 59 from run at > AccessController.java:-2 > 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default > tbl=test_parquet > 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant > ip=unknown-ip-addr cmd=get_table : db=default tbl=test_parquet > 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called > with curMem=4432658, maxMem=139009720 > 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_60 stored as > values in memory (estimated size 210.6 KB, free 128.1 MB) > 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_58_piece0 > on 192.168.99.9:50262 in memory (size: 19.8 KB, free: 132.2 MB) > 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_57_piece0 > on 192
[jira] [Assigned] (SPARK-11246) [1.5] Table cache for Parquet broken in 1.5
[ https://issues.apache.org/jira/browse/SPARK-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11246: Assignee: Apache Spark > [1.5] Table cache for Parquet broken in 1.5 > --- > > Key: SPARK-11246 > URL: https://issues.apache.org/jira/browse/SPARK-11246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: David Ross >Assignee: Apache Spark > > Since upgrading to 1.5.1, using the {{CACHE TABLE}} works great for all > tables except for parquet tables, likely related to the parquet native reader. > Here are steps for parquet table: > {code} > create table test_parquet stored as parquet as select 1; > explain select * from test_parquet; > {code} > With output: > {code} > == Physical Plan == > Scan > ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#141] > {code} > And then caching: > {code} > cache table test_parquet; > explain select * from test_parquet; > {code} > With output: > {code} > == Physical Plan == > Scan > ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#174] > {code} > Note it isn't cached. I have included spark log output for the {{cache > table}} and {{explain}} statements below. > --- > Here's the same for non-parquet table: > {code} > cache table test_no_parquet; > explain select * from test_no_parquet; > {code} > With output: > {code} > == Physical Plan == > HiveTableScan [_c0#210], (MetastoreRelation default, test_no_parquet, None) > {code} > And then caching: > {code} > cache table test_no_parquet; > explain select * from test_no_parquet; > {code} > With output: > {code} > == Physical Plan == > InMemoryColumnarTableScan [_c0#229], (InMemoryRelation [_c0#229], true, > 1, StorageLevel(true, true, false, true, 1), (HiveTableScan [_c0#211], > (MetastoreRelation default, test_no_parquet, None)), Some(test_no_parquet)) > {code} > Not that the table seems to be cached. > --- > Note that if the flag {{spark.sql.hive.convertMetastoreParquet}} is set to > {{false}}, parquet tables work the same as non-parquet tables with caching. > This is a reasonable workaround for us, but ideally, we would like to benefit > from the native reading. > --- > Spark logs for {{cache table}} for {{test_parquet}}: > {code} > 15/10/21 21:22:05 INFO thriftserver.SparkExecuteStatementOperation: Running > query 'cache table test_parquet' with 20ee2ab9-5242-4783-81cf-46115ed72610 > 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default > tbl=test_parquet > 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant > ip=unknown-ip-addr cmd=get_table : db=default tbl=test_parquet > 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: Opening raw store with > implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 15/10/21 21:22:05 INFO metastore.ObjectStore: ObjectStore, initialize called > 15/10/21 21:22:05 INFO DataNucleus.Query: Reading in results for query > "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is > closing > 15/10/21 21:22:05 INFO metastore.MetaStoreDirectSql: Using direct SQL, > underlying DB is MYSQL > 15/10/21 21:22:05 INFO metastore.ObjectStore: Initialized ObjectStore > 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called > with curMem=4196713, maxMem=139009720 > 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59 stored as > values in memory (estimated size 210.6 KB, free 128.4 MB) > 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(20265) called > with curMem=4412393, maxMem=139009720 > 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59_piece0 stored > as bytes in memory (estimated size 19.8 KB, free 128.3 MB) > 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Added broadcast_59_piece0 in > memory on 192.168.99.9:50262 (size: 19.8 KB, free: 132.2 MB) > 15/10/21 21:22:05 INFO spark.SparkContext: Created broadcast 59 from run at > AccessController.java:-2 > 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default > tbl=test_parquet > 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant > ip=unknown-ip-addr cmd=get_table : db=default tbl=test_parquet > 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called > with curMem=4432658, maxMem=139009720 > 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_60 stored as > values in memory (estimated size 210.6 KB, free 128.1 MB) > 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_58_piece0 > on 192.168.99.9:50262 in memory (size: 19.8 KB, free: 132.2 MB) > 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_57_piece0 > on 192.168.99.9:50262 in memory (size: 21.1 KB, free: 132.2 MB) > 15/10/21 21:22:05 INFO stora
[jira] [Assigned] (SPARK-11246) [1.5] Table cache for Parquet broken in 1.5
[ https://issues.apache.org/jira/browse/SPARK-11246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11246: Assignee: (was: Apache Spark) > [1.5] Table cache for Parquet broken in 1.5 > --- > > Key: SPARK-11246 > URL: https://issues.apache.org/jira/browse/SPARK-11246 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: David Ross > > Since upgrading to 1.5.1, using the {{CACHE TABLE}} works great for all > tables except for parquet tables, likely related to the parquet native reader. > Here are steps for parquet table: > {code} > create table test_parquet stored as parquet as select 1; > explain select * from test_parquet; > {code} > With output: > {code} > == Physical Plan == > Scan > ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#141] > {code} > And then caching: > {code} > cache table test_parquet; > explain select * from test_parquet; > {code} > With output: > {code} > == Physical Plan == > Scan > ParquetRelation[hdfs://192.168.99.9/user/hive/warehouse/test_parquet][_c0#174] > {code} > Note it isn't cached. I have included spark log output for the {{cache > table}} and {{explain}} statements below. > --- > Here's the same for non-parquet table: > {code} > cache table test_no_parquet; > explain select * from test_no_parquet; > {code} > With output: > {code} > == Physical Plan == > HiveTableScan [_c0#210], (MetastoreRelation default, test_no_parquet, None) > {code} > And then caching: > {code} > cache table test_no_parquet; > explain select * from test_no_parquet; > {code} > With output: > {code} > == Physical Plan == > InMemoryColumnarTableScan [_c0#229], (InMemoryRelation [_c0#229], true, > 1, StorageLevel(true, true, false, true, 1), (HiveTableScan [_c0#211], > (MetastoreRelation default, test_no_parquet, None)), Some(test_no_parquet)) > {code} > Not that the table seems to be cached. > --- > Note that if the flag {{spark.sql.hive.convertMetastoreParquet}} is set to > {{false}}, parquet tables work the same as non-parquet tables with caching. > This is a reasonable workaround for us, but ideally, we would like to benefit > from the native reading. > --- > Spark logs for {{cache table}} for {{test_parquet}}: > {code} > 15/10/21 21:22:05 INFO thriftserver.SparkExecuteStatementOperation: Running > query 'cache table test_parquet' with 20ee2ab9-5242-4783-81cf-46115ed72610 > 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default > tbl=test_parquet > 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant > ip=unknown-ip-addr cmd=get_table : db=default tbl=test_parquet > 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: Opening raw store with > implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 15/10/21 21:22:05 INFO metastore.ObjectStore: ObjectStore, initialize called > 15/10/21 21:22:05 INFO DataNucleus.Query: Reading in results for query > "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is > closing > 15/10/21 21:22:05 INFO metastore.MetaStoreDirectSql: Using direct SQL, > underlying DB is MYSQL > 15/10/21 21:22:05 INFO metastore.ObjectStore: Initialized ObjectStore > 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called > with curMem=4196713, maxMem=139009720 > 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59 stored as > values in memory (estimated size 210.6 KB, free 128.4 MB) > 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(20265) called > with curMem=4412393, maxMem=139009720 > 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_59_piece0 stored > as bytes in memory (estimated size 19.8 KB, free 128.3 MB) > 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Added broadcast_59_piece0 in > memory on 192.168.99.9:50262 (size: 19.8 KB, free: 132.2 MB) > 15/10/21 21:22:05 INFO spark.SparkContext: Created broadcast 59 from run at > AccessController.java:-2 > 15/10/21 21:22:05 INFO metastore.HiveMetaStore: 49: get_table : db=default > tbl=test_parquet > 15/10/21 21:22:05 INFO HiveMetaStore.audit: ugi=vagrant > ip=unknown-ip-addr cmd=get_table : db=default tbl=test_parquet > 15/10/21 21:22:05 INFO storage.MemoryStore: ensureFreeSpace(215680) called > with curMem=4432658, maxMem=139009720 > 15/10/21 21:22:05 INFO storage.MemoryStore: Block broadcast_60 stored as > values in memory (estimated size 210.6 KB, free 128.1 MB) > 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_58_piece0 > on 192.168.99.9:50262 in memory (size: 19.8 KB, free: 132.2 MB) > 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Removed broadcast_57_piece0 > on 192.168.99.9:50262 in memory (size: 21.1 KB, free: 132.2 MB) > 15/10/21 21:22:05 INFO storage.BlockManagerInfo: Remo
[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977961#comment-14977961 ] Cheng Lian commented on SPARK-11103: Quoted from my reply on the user list: For 1: This one is pretty simple and safe, I'd like to have this for 1.5.2, or 1.5.3 if we can't make it for 1.5.2. For 2: Actually we only need to calculate the intersection of all file schemata. We can make ParquetRelation.mergeSchemaInParallel return two StructTypes, the first one is the original merged schema, the other is the intersection of all file schemata, which only contains fields that exist in all file schemata. Then we decide which filter to pushed down according to the second StructType. For 3: The idea with which I came up at first was similar to this one. Instead of pulling all file schemata to driver side, we can push filter push-down code to executor side. Namely, passing candidate filters to executor side, and compute the Parquet filter predicates according to individual file schema. I haven't looked into this direction in depth, but we can probably put this part into CatalystReadSupport, which is now initialized on executor side. However, correctness of this approach can only be guaranteed by the defensive filtering we do in Spark SQL (i.e. apply all the filters no matter they are pushed down or not), but we are considering to remove it because it imposes unnecessary performance cost. This makes me hesitant to go along this way. >From my side, I think this is a bug of Parquet. Parquet was designed to >support schema evolution. When scanning a Parquet file, if a column exists in >the requested schema but is missing in the file schema, that column is filled >with null. This should also hold for pushed-down filter predicates. For >example, if filter "a = 1" is pushed down but column "a" doesn't exist in the >Parquet file being scanned, it's safe to assume "a" is null in all records and >drop all of them. On the contrary, if "a IS NULL" is pushed down, all records >should be preserved. Filed PARQUET-389 to track this issue. > Filter applied on Merged Parquet shema with new column fail with > (java.lang.IllegalArgumentException: Column [column_name] was not found in > schema!) > > > Key: SPARK-11103 > URL: https://issues.apache.org/jira/browse/SPARK-11103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Dominic Ricard > > When evolving a schema in parquet files, spark properly expose all columns > found in the different parquet files but when trying to query the data, it is > not possible to apply a filter on a column that is not present in all files. > To reproduce: > *SQL:* > {noformat} > create table `table1` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`; > create table `table2` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as > `col2`; > create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path > "hdfs://:/path/to/table"); > select col1 from `table3` where col2 = 2; > {noformat} > The last select will output the following Stack Trace: > {noformat} > An error occurred when executing the SQL command: > select col1 from `table3` where col2 = 2 > [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: > 0, SQL state: TStatus(statusCode:ERROR_STATUS, > infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, > 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not > found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59) > at > org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaC
[jira] [Updated] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11103: --- Assignee: Hyukjin Kwon > Filter applied on Merged Parquet shema with new column fail with > (java.lang.IllegalArgumentException: Column [column_name] was not found in > schema!) > > > Key: SPARK-11103 > URL: https://issues.apache.org/jira/browse/SPARK-11103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Dominic Ricard >Assignee: Hyukjin Kwon > > When evolving a schema in parquet files, spark properly expose all columns > found in the different parquet files but when trying to query the data, it is > not possible to apply a filter on a column that is not present in all files. > To reproduce: > *SQL:* > {noformat} > create table `table1` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`; > create table `table2` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as > `col2`; > create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path > "hdfs://:/path/to/table"); > select col1 from `table3` where col2 = 2; > {noformat} > The last select will output the following Stack Trace: > {noformat} > An error occurred when executing the SQL command: > select col1 from `table3` where col2 = 2 > [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: > 0, SQL state: TStatus(statusCode:ERROR_STATUS, > infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, > 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not > found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59) > at > org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40) > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126) > at > org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155) > at > org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.sche
[jira] [Comment Edited] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977961#comment-14977961 ] Cheng Lian edited comment on SPARK-11103 at 10/28/15 8:32 AM: -- Quoted from my reply on the user list: For 1: This one is pretty simple and safe, I'd like to have this for 1.5.2, or 1.5.3 if we can't make it for 1.5.2. For 2: I'd like to have this for Spark master. Actually we only need to calculate the intersection of all file schemata. We can make ParquetRelation.mergeSchemaInParallel return two StructTypes, the first one is the original merged schema, the other is the intersection of all file schemata, which only contains fields that exist in all file schemata. Then we decide which filter to pushed down according to the second StructType. For 3: The idea with which I came up at first was similar to this one. Instead of pulling all file schemata to driver side, we can push filter push-down code to executor side. Namely, passing candidate filters to executor side, and compute the Parquet filter predicates according to individual file schema. I haven't looked into this direction in depth, but we can probably put this part into CatalystReadSupport, which is now initialized on executor side. However, correctness of this approach can only be guaranteed by the defensive filtering we do in Spark SQL (i.e. apply all the filters no matter they are pushed down or not), but we are considering to remove it because it imposes unnecessary performance cost. This makes me hesitant to go along this way. >From my side, I think this is a bug of Parquet. Parquet was designed to >support schema evolution. When scanning a Parquet file, if a column exists in >the requested schema but is missing in the file schema, that column is filled >with null. This should also hold for pushed-down filter predicates. For >example, if filter "a = 1" is pushed down but column "a" doesn't exist in the >Parquet file being scanned, it's safe to assume "a" is null in all records and >drop all of them. On the contrary, if "a IS NULL" is pushed down, all records >should be preserved. Filed PARQUET-389 to track this issue. was (Author: lian cheng): Quoted from my reply on the user list: For 1: This one is pretty simple and safe, I'd like to have this for 1.5.2, or 1.5.3 if we can't make it for 1.5.2. For 2: Actually we only need to calculate the intersection of all file schemata. We can make ParquetRelation.mergeSchemaInParallel return two StructTypes, the first one is the original merged schema, the other is the intersection of all file schemata, which only contains fields that exist in all file schemata. Then we decide which filter to pushed down according to the second StructType. For 3: The idea with which I came up at first was similar to this one. Instead of pulling all file schemata to driver side, we can push filter push-down code to executor side. Namely, passing candidate filters to executor side, and compute the Parquet filter predicates according to individual file schema. I haven't looked into this direction in depth, but we can probably put this part into CatalystReadSupport, which is now initialized on executor side. However, correctness of this approach can only be guaranteed by the defensive filtering we do in Spark SQL (i.e. apply all the filters no matter they are pushed down or not), but we are considering to remove it because it imposes unnecessary performance cost. This makes me hesitant to go along this way. >From my side, I think this is a bug of Parquet. Parquet was designed to >support schema evolution. When scanning a Parquet file, if a column exists in >the requested schema but is missing in the file schema, that column is filled >with null. This should also hold for pushed-down filter predicates. For >example, if filter "a = 1" is pushed down but column "a" doesn't exist in the >Parquet file being scanned, it's safe to assume "a" is null in all records and >drop all of them. On the contrary, if "a IS NULL" is pushed down, all records >should be preserved. Filed PARQUET-389 to track this issue. > Filter applied on Merged Parquet shema with new column fail with > (java.lang.IllegalArgumentException: Column [column_name] was not found in > schema!) > > > Key: SPARK-11103 > URL: https://issues.apache.org/jira/browse/SPARK-11103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Dominic Ricard >Assignee: Hyukjin Kwon > > When evolving a schema in parquet files, spark properly expose all columns > found in the different parquet files but when trying to quer
[jira] [Updated] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11103: --- Target Version/s: 1.5.3, 1.6.0 (was: 1.6.0) > Filter applied on Merged Parquet shema with new column fail with > (java.lang.IllegalArgumentException: Column [column_name] was not found in > schema!) > > > Key: SPARK-11103 > URL: https://issues.apache.org/jira/browse/SPARK-11103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Dominic Ricard >Assignee: Hyukjin Kwon > > When evolving a schema in parquet files, spark properly expose all columns > found in the different parquet files but when trying to query the data, it is > not possible to apply a filter on a column that is not present in all files. > To reproduce: > *SQL:* > {noformat} > create table `table1` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`; > create table `table2` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as > `col2`; > create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path > "hdfs://:/path/to/table"); > select col1 from `table3` where col2 = 2; > {noformat} > The last select will output the following Stack Trace: > {noformat} > An error occurred when executing the SQL command: > select col1 from `table3` where col2 = 2 > [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: > 0, SQL state: TStatus(statusCode:ERROR_STATUS, > infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, > 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not > found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59) > at > org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40) > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126) > at > org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155) > at > org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at
[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored
[ https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977969#comment-14977969 ] Saisai Shao commented on SPARK-2089: Hi [~pwendell], [~mridulm80], [~sandyr] and [~lianhuiwang], I'm still thinking enabling this feature is quite useful, for many workloads like batch workload, which from my understanding dynamic allocation is not used necessarily, currently there's no way to provide locality hints for AM to allocate containers, this will affect the performance a lot if data is fetched remotely, especially in large Yarn cluster. So I'd incline to revive this feature, but maybe with another way (current way is so hadoop way and cannot be worked in yarn-client mode), also this locality computation can reuse the implementation of SPARK-4352, I'd like to spend some time to take crack at this issue, what's your opinion and concern? Is it still necessary to address this issue, since it is broken for many versions. Any comment is greatly appreciated, thanks a lot. > With YARN, preferredNodeLocalityData isn't honored > --- > > Key: SPARK-2089 > URL: https://issues.apache.org/jira/browse/SPARK-2089 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.0.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza >Priority: Critical > > When running in YARN cluster mode, apps can pass preferred locality data when > constructing a Spark context that will dictate where to request executor > containers. > This is currently broken because of a race condition. The Spark-YARN code > runs the user class and waits for it to start up a SparkContext. During its > initialization, the SparkContext will create a YarnClusterScheduler, which > notifies a monitor in the Spark-YARN code that . The Spark-Yarn code then > immediately fetches the preferredNodeLocationData from the SparkContext and > uses it to start requesting containers. > But in the SparkContext constructor that takes the preferredNodeLocationData, > setting preferredNodeLocationData comes after the rest of the initialization, > so, if the Spark-YARN code comes around quickly enough after being notified, > the data that's fetched is the empty unset version. The occurred during all > of my runs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11103: Assignee: Apache Spark (was: Hyukjin Kwon) > Filter applied on Merged Parquet shema with new column fail with > (java.lang.IllegalArgumentException: Column [column_name] was not found in > schema!) > > > Key: SPARK-11103 > URL: https://issues.apache.org/jira/browse/SPARK-11103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Dominic Ricard >Assignee: Apache Spark > > When evolving a schema in parquet files, spark properly expose all columns > found in the different parquet files but when trying to query the data, it is > not possible to apply a filter on a column that is not present in all files. > To reproduce: > *SQL:* > {noformat} > create table `table1` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`; > create table `table2` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as > `col2`; > create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path > "hdfs://:/path/to/table"); > select col1 from `table3` where col2 = 2; > {noformat} > The last select will output the following Stack Trace: > {noformat} > An error occurred when executing the SQL command: > select col1 from `table3` where col2 = 2 > [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: > 0, SQL state: TStatus(statusCode:ERROR_STATUS, > infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, > 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not > found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59) > at > org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40) > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126) > at > org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155) > at > org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
[jira] [Assigned] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11103: Assignee: Hyukjin Kwon (was: Apache Spark) > Filter applied on Merged Parquet shema with new column fail with > (java.lang.IllegalArgumentException: Column [column_name] was not found in > schema!) > > > Key: SPARK-11103 > URL: https://issues.apache.org/jira/browse/SPARK-11103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Dominic Ricard >Assignee: Hyukjin Kwon > > When evolving a schema in parquet files, spark properly expose all columns > found in the different parquet files but when trying to query the data, it is > not possible to apply a filter on a column that is not present in all files. > To reproduce: > *SQL:* > {noformat} > create table `table1` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`; > create table `table2` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as > `col2`; > create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path > "hdfs://:/path/to/table"); > select col1 from `table3` where col2 = 2; > {noformat} > The last select will output the following Stack Trace: > {noformat} > An error occurred when executing the SQL command: > select col1 from `table3` where col2 = 2 > [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: > 0, SQL state: TStatus(statusCode:ERROR_STATUS, > infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, > 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not > found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59) > at > org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40) > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126) > at > org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155) > at > org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977976#comment-14977976 ] Apache Spark commented on SPARK-11103: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/9327 > Filter applied on Merged Parquet shema with new column fail with > (java.lang.IllegalArgumentException: Column [column_name] was not found in > schema!) > > > Key: SPARK-11103 > URL: https://issues.apache.org/jira/browse/SPARK-11103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Dominic Ricard >Assignee: Hyukjin Kwon > > When evolving a schema in parquet files, spark properly expose all columns > found in the different parquet files but when trying to query the data, it is > not possible to apply a filter on a column that is not present in all files. > To reproduce: > *SQL:* > {noformat} > create table `table1` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`; > create table `table2` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as > `col2`; > create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path > "hdfs://:/path/to/table"); > select col1 from `table3` where col2 = 2; > {noformat} > The last select will output the following Stack Trace: > {noformat} > An error occurred when executing the SQL command: > select col1 from `table3` where col2 = 2 > [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: > 0, SQL state: TStatus(statusCode:ERROR_STATUS, > infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, > 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not > found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59) > at > org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40) > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126) > at > org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155) > at > org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978000#comment-14978000 ] Koert Kuipers commented on SPARK-3655: -- spark-sorted (https://github.com/tresata/spark-sorted) allows you to process your data in a similar way to what you did below, but without materializing the sorted data as a list in memory. yes it uses a custom partitioner and yes it always shuffles the data. are you sure the shuffle is your issue? i assume your final output is not the sorted list? what do you do with the sorted list after the steps shown below? if your final output is RDD[(String, List[(Long, String)])] then there is no way around materializing the list in memory and then spark-sorted will not give you any benefit over what you did below. > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: koert kuipers >Assignee: Koert Kuipers > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11366) binary functions (methods) such as rlike in pyspark.sql.column only accept strings but should also accept another Column
Jose Antonio created SPARK-11366: Summary: binary functions (methods) such as rlike in pyspark.sql.column only accept strings but should also accept another Column Key: SPARK-11366 URL: https://issues.apache.org/jira/browse/SPARK-11366 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.5.1, 1.5.0, 1.4.1, 1.4.0, 1.6.0 Environment: All. Reporter: Jose Antonio binary functions (methods) such as rlike in pyspark.sql.column only accept strings but should also accept another Column. For instance: df.mycolumn.rlike(df.mycolumn2) Gives the error: Method rlike([class org.apache.spark.sql.Column]) does not exist but df.mycolumn.rlike('test') is OK. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11167) Incorrect type resolution on heterogeneous data structures
[ https://issues.apache.org/jira/browse/SPARK-11167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978007#comment-14978007 ] Sun Rui commented on SPARK-11167: - It seems time-consuming and not desirable to going through the entire input data to infer the tightest type or to check if heterogeneous values exist. I agree with [~zero323] that it is practical to issue a warning in such case. Users should be responsible for homogeneous input data. Otherwise, there maybe runtime exception on type mismatch thrown by Spark. > Incorrect type resolution on heterogeneous data structures > -- > > Key: SPARK-11167 > URL: https://issues.apache.org/jira/browse/SPARK-11167 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Maciej Szymkiewicz > > If structure contains heterogeneous incorrectly assigns type of the > encountered element as type of a whole structure. This problem affects both > lists: > {code} > SparkR:::infer_type(list(a=1, b="a") > ## [1] "array" > SparkR:::infer_type(list(a="a", b=1)) > ## [1] "array" > {code} > and environments: > {code} > SparkR:::infer_type(as.environment(list(a=1, b="a"))) > ## [1] "map" > SparkR:::infer_type(as.environment(list(a="a", b=1))) > ## [1] "map" > {code} > This results in errors during data collection and other operations on > DataFrames: > {code} > ldf <- data.frame(row.names=1:2) > ldf$foo <- list(list("1", 2), list(3, 4)) > sdf <- createDataFrame(sqlContext, ldf) > collect(sdf) > ## 15/10/17 17:58:57 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID > 9) > ## scala.MatchError: 2.0 (of class java.lang.Double) > ## ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11367) Python LinearRegression should support setting solver
Yanbo Liang created SPARK-11367: --- Summary: Python LinearRegression should support setting solver Key: SPARK-11367 URL: https://issues.apache.org/jira/browse/SPARK-11367 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Priority: Minor Python LinearRegression should support setting solver("auto", "normal", "l-bfgs") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11368) Spark scan all partitions when using Python UDF and filter over partitioned column is given
Maciej Bryński created SPARK-11368: -- Summary: Spark scan all partitions when using Python UDF and filter over partitioned column is given Key: SPARK-11368 URL: https://issues.apache.org/jira/browse/SPARK-11368 Project: Spark Issue Type: Bug Components: PySpark, SQL Reporter: Maciej Bryński Priority: Critical Hi, I think this is huge performance bug. I created parquet file partitioned by column. Then I make query with filter over partition column and filter with UDF. Result is that all partition is scanned. Sample data: {code} rdd = sc.parallelize(range(0,1000)).map(lambda x: Row(id=x%1000,value=x)).repartition(1) df = sqlCtx.createDataFrame(rdd) df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test') df = sqlCtx.read.parquet('/mnt/mfs/udf_test') df.registerTempTable('df') {code} Then queries: Without udf - Spark reads only 10 partitions: {code} start = time.time() sqlCtx.sql('select * from df where id >= 990 and value > 10').count() print(time.time() - start) 0.9993703365325928 {code} With udf Spark reads all the partitions: {code} sqlCtx.registerFunction('multiply2', lambda x: x *2 ) start = time.time() sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > 20').count() print(time.time() - start) 13.0826096534729 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11367) Python LinearRegression should support setting solver
[ https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978037#comment-14978037 ] Apache Spark commented on SPARK-11367: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/9328 > Python LinearRegression should support setting solver > - > > Key: SPARK-11367 > URL: https://issues.apache.org/jira/browse/SPARK-11367 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Python ML LinearRegression should support setting solver("auto", "normal", > "l-bfgs") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11367) Python LinearRegression should support setting solver
[ https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11367: Assignee: Apache Spark > Python LinearRegression should support setting solver > - > > Key: SPARK-11367 > URL: https://issues.apache.org/jira/browse/SPARK-11367 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > Python ML LinearRegression should support setting solver("auto", "normal", > "l-bfgs") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11367) Python LinearRegression should support setting solver
[ https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11367: Description: Python ML LinearRegression should support setting solver("auto", "normal", "l-bfgs") (was: Python LinearRegression should support setting solver("auto", "normal", "l-bfgs")) > Python LinearRegression should support setting solver > - > > Key: SPARK-11367 > URL: https://issues.apache.org/jira/browse/SPARK-11367 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Python ML LinearRegression should support setting solver("auto", "normal", > "l-bfgs") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11367) Python LinearRegression should support setting solver
[ https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11367: Assignee: (was: Apache Spark) > Python LinearRegression should support setting solver > - > > Key: SPARK-11367 > URL: https://issues.apache.org/jira/browse/SPARK-11367 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Python ML LinearRegression should support setting solver("auto", "normal", > "l-bfgs") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11367) Python LinearRegression should support setting solver
[ https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11367: Description: SPARK-10668 has provided WeightedLeastSquares solver("normal") in LinearRegression with L2 regularization, Python ML LinearRegression should support setting solver("auto", "normal", "l-bfgs") (was: SPARK-10668 has provide WeightedLeastSquares solver("normal") in LinearRegression with L2 regularization, Python ML LinearRegression should support setting solver("auto", "normal", "l-bfgs")) > Python LinearRegression should support setting solver > - > > Key: SPARK-11367 > URL: https://issues.apache.org/jira/browse/SPARK-11367 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > SPARK-10668 has provided WeightedLeastSquares solver("normal") in > LinearRegression with L2 regularization, Python ML LinearRegression should > support setting solver("auto", "normal", "l-bfgs") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11368) Spark scan all partitions when using Python UDF and filter over partitioned column is given
[ https://issues.apache.org/jira/browse/SPARK-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-11368: --- Description: Hi, I think this is huge performance bug. I created parquet file partitioned by column. Then I make query with filter over partition column and filter with UDF. Result is that all partition is scanned. Sample data: {code} rdd = sc.parallelize(range(0,1000)).map(lambda x: Row(id=x%1000,value=x)).repartition(1) df = sqlCtx.createDataFrame(rdd) df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test') df = sqlCtx.read.parquet('/mnt/mfs/udf_test') df.registerTempTable('df') {code} Then queries: Without udf - Spark reads only 10 partitions: {code} start = time.time() sqlCtx.sql('select * from df where id >= 990 and value > 10').count() print(time.time() - start) 0.9993703365325928 == Physical Plan == TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#22L]) TungstenExchange SinglePartition TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#25L]) Project Filter (value#5L > 10) Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L] {code} With udf Spark reads all the partitions: {code} sqlCtx.registerFunction('multiply2', lambda x: x *2 ) start = time.time() sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > 20').count() print(time.time() - start) 13.0826096534729 == Physical Plan == TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#34L]) TungstenExchange SinglePartition TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#37L]) TungstenProject Filter ((id#6 > 990) && (cast(pythonUDF#33 as double) > 20.0)) !BatchPythonEvaluation PythonUDF#multiply2(value#5L), [value#5L,id#6,pythonUDF#33] Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L,id#6] {code} was: Hi, I think this is huge performance bug. I created parquet file partitioned by column. Then I make query with filter over partition column and filter with UDF. Result is that all partition is scanned. Sample data: {code} rdd = sc.parallelize(range(0,1000)).map(lambda x: Row(id=x%1000,value=x)).repartition(1) df = sqlCtx.createDataFrame(rdd) df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test') df = sqlCtx.read.parquet('/mnt/mfs/udf_test') df.registerTempTable('df') {code} Then queries: Without udf - Spark reads only 10 partitions: {code} start = time.time() sqlCtx.sql('select * from df where id >= 990 and value > 10').count() print(time.time() - start) 0.9993703365325928 {code} With udf Spark reads all the partitions: {code} sqlCtx.registerFunction('multiply2', lambda x: x *2 ) start = time.time() sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > 20').count() print(time.time() - start) 13.0826096534729 {code} > Spark scan all partitions when using Python UDF and filter over partitioned > column is given > --- > > Key: SPARK-11368 > URL: https://issues.apache.org/jira/browse/SPARK-11368 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Maciej Bryński >Priority: Critical > > Hi, > I think this is huge performance bug. > I created parquet file partitioned by column. > Then I make query with filter over partition column and filter with UDF. > Result is that all partition is scanned. > Sample data: > {code} > rdd = sc.parallelize(range(0,1000)).map(lambda x: > Row(id=x%1000,value=x)).repartition(1) > df = sqlCtx.createDataFrame(rdd) > df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test') > df = sqlCtx.read.parquet('/mnt/mfs/udf_test') > df.registerTempTable('df') > {code} > Then queries: > Without udf - Spark reads only 10 partitions: > {code} > start = time.time() > sqlCtx.sql('select * from df where id >= 990 and value > 10').count() > print(time.time() - start) > 0.9993703365325928 > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[count#22L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#25L]) >Project > Filter (value#5L > 10) > Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L] > {code} > With udf Spark reads all the partitions: > {code} > sqlCtx.registerFunction('multiply2', lambda x: x *2 ) > start = time.time() > sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > > 20').count() > print(time.time() - start) > 13.0826096534729 > == Physical Plan == > TungstenAggregate(key=[], func
[jira] [Updated] (SPARK-11367) Python LinearRegression should support setting solver
[ https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11367: Description: SPARK-10668 has provide WeightedLeastSquares solver("normal") in LinearRegression with L2 regularization, Python ML LinearRegression should support setting solver("auto", "normal", "l-bfgs") (was: Python ML LinearRegression should support setting solver("auto", "normal", "l-bfgs")) > Python LinearRegression should support setting solver > - > > Key: SPARK-11367 > URL: https://issues.apache.org/jira/browse/SPARK-11367 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > SPARK-10668 has provide WeightedLeastSquares solver("normal") in > LinearRegression with L2 regularization, Python ML LinearRegression should > support setting solver("auto", "normal", "l-bfgs") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11368) Spark scan all partitions when using Python UDF and filter over partitioned column is given
[ https://issues.apache.org/jira/browse/SPARK-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-11368: --- Description: Hi, I think this is huge performance bug. I created parquet file partitioned by column. Then I make query with filter over partition column and filter with UDF. Result is that all partition are scanned. Sample data: {code} rdd = sc.parallelize(range(0,1000)).map(lambda x: Row(id=x%1000,value=x)).repartition(1) df = sqlCtx.createDataFrame(rdd) df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test') df = sqlCtx.read.parquet('/mnt/mfs/udf_test') df.registerTempTable('df') {code} Then queries: Without udf - Spark reads only 10 partitions: {code} start = time.time() sqlCtx.sql('select * from df where id >= 990 and value > 10').count() print(time.time() - start) 0.9993703365325928 == Physical Plan == TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#22L]) TungstenExchange SinglePartition TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#25L]) Project Filter (value#5L > 10) Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L] {code} With udf Spark reads all the partitions: {code} sqlCtx.registerFunction('multiply2', lambda x: x *2 ) start = time.time() sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > 20').count() print(time.time() - start) 13.0826096534729 == Physical Plan == TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#34L]) TungstenExchange SinglePartition TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#37L]) TungstenProject Filter ((id#6 > 990) && (cast(pythonUDF#33 as double) > 20.0)) !BatchPythonEvaluation PythonUDF#multiply2(value#5L), [value#5L,id#6,pythonUDF#33] Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L,id#6] {code} was: Hi, I think this is huge performance bug. I created parquet file partitioned by column. Then I make query with filter over partition column and filter with UDF. Result is that all partition is scanned. Sample data: {code} rdd = sc.parallelize(range(0,1000)).map(lambda x: Row(id=x%1000,value=x)).repartition(1) df = sqlCtx.createDataFrame(rdd) df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test') df = sqlCtx.read.parquet('/mnt/mfs/udf_test') df.registerTempTable('df') {code} Then queries: Without udf - Spark reads only 10 partitions: {code} start = time.time() sqlCtx.sql('select * from df where id >= 990 and value > 10').count() print(time.time() - start) 0.9993703365325928 == Physical Plan == TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#22L]) TungstenExchange SinglePartition TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#25L]) Project Filter (value#5L > 10) Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L] {code} With udf Spark reads all the partitions: {code} sqlCtx.registerFunction('multiply2', lambda x: x *2 ) start = time.time() sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > 20').count() print(time.time() - start) 13.0826096534729 == Physical Plan == TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#34L]) TungstenExchange SinglePartition TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#37L]) TungstenProject Filter ((id#6 > 990) && (cast(pythonUDF#33 as double) > 20.0)) !BatchPythonEvaluation PythonUDF#multiply2(value#5L), [value#5L,id#6,pythonUDF#33] Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L,id#6] {code} > Spark scan all partitions when using Python UDF and filter over partitioned > column is given > --- > > Key: SPARK-11368 > URL: https://issues.apache.org/jira/browse/SPARK-11368 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Maciej Bryński >Priority: Critical > > Hi, > I think this is huge performance bug. > I created parquet file partitioned by column. > Then I make query with filter over partition column and filter with UDF. > Result is that all partition are scanned. > Sample data: > {code} > rdd = sc.parallelize(range(0,1000)).map(lambda x: > Row(id=x%1000,value=x)).repartition(1) > df = sqlCtx.createDataFrame(rdd) > df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test') > df = sqlCtx.read.parquet('/mnt/mfs/udf_test') > df.registerTempTable('df') > {code} > Then queries: > Without udf - Spark reads only 10 partitio
[jira] [Updated] (SPARK-11367) Python LinearRegression should support setting solver
[ https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11367: Description: SPARK-10668 has provided WeightedLeastSquares solver("normal") in LinearRegression with L2 regularization in Scala and R, Python ML LinearRegression should also support setting solver("auto", "normal", "l-bfgs") (was: SPARK-10668 has provided WeightedLeastSquares solver("normal") in LinearRegression with L2 regularization, Python ML LinearRegression should support setting solver("auto", "normal", "l-bfgs")) > Python LinearRegression should support setting solver > - > > Key: SPARK-11367 > URL: https://issues.apache.org/jira/browse/SPARK-11367 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > SPARK-10668 has provided WeightedLeastSquares solver("normal") in > LinearRegression with L2 regularization in Scala and R, Python ML > LinearRegression should also support setting solver("auto", "normal", > "l-bfgs") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11369) SparkR glm should support setting standardize
Yanbo Liang created SPARK-11369: --- Summary: SparkR glm should support setting standardize Key: SPARK-11369 URL: https://issues.apache.org/jira/browse/SPARK-11369 Project: Spark Issue Type: Improvement Components: ML, R Reporter: Yanbo Liang Priority: Minor SparkR glm currently support : formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0 We should also support setting standardize which has been defined at [design documentation|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#heading=h.b2igi6cx7yf] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11370) fix a bug in GroupedIterator and create unit test for it
Wenchen Fan created SPARK-11370: --- Summary: fix a bug in GroupedIterator and create unit test for it Key: SPARK-11370 URL: https://issues.apache.org/jira/browse/SPARK-11370 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11370) fix a bug in GroupedIterator and create unit test for it
[ https://issues.apache.org/jira/browse/SPARK-11370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978125#comment-14978125 ] Apache Spark commented on SPARK-11370: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/9330 > fix a bug in GroupedIterator and create unit test for it > > > Key: SPARK-11370 > URL: https://issues.apache.org/jira/browse/SPARK-11370 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11370) fix a bug in GroupedIterator and create unit test for it
[ https://issues.apache.org/jira/browse/SPARK-11370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11370: Assignee: (was: Apache Spark) > fix a bug in GroupedIterator and create unit test for it > > > Key: SPARK-11370 > URL: https://issues.apache.org/jira/browse/SPARK-11370 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11370) fix a bug in GroupedIterator and create unit test for it
[ https://issues.apache.org/jira/browse/SPARK-11370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11370: Assignee: Apache Spark > fix a bug in GroupedIterator and create unit test for it > > > Key: SPARK-11370 > URL: https://issues.apache.org/jira/browse/SPARK-11370 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver
[ https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978149#comment-14978149 ] Yanbo Liang commented on SPARK-9836: I will work on it. > Provide R-like summary statistics for ordinary least squares via normal > equation solver > --- > > Key: SPARK-9836 > URL: https://issues.apache.org/jira/browse/SPARK-9836 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng > > In R, model fitting comes with summary statistics. We can provide most of > those via normal equation solver (SPARK-9834). If some statistics requires > additional passes to the dataset, we can expose an option to let users select > desired statistics before model fitting. > {code} > > summary(model) > Call: > glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris) > Deviance Residuals: > Min1QMedian3Q Max > -1.30711 -0.25713 -0.05325 0.19542 1.41253 > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 2.2514 0.3698 6.089 9.57e-09 *** > Sepal.Width 0.8036 0.1063 7.557 4.19e-12 *** > Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 *** > Speciesvirginica1.9468 0.1000 19.465 < 2e-16 *** > --- > Signif. codes: > 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > (Dispersion parameter for gaussian family taken to be 0.1918059) > Null deviance: 102.168 on 149 degrees of freedom > Residual deviance: 28.004 on 146 degrees of freedom > AIC: 183.94 > Number of Fisher Scoring iterations: 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11371) Make "mean" an alias for "avg" operator
Ted Yu created SPARK-11371: -- Summary: Make "mean" an alias for "avg" operator Key: SPARK-11371 URL: https://issues.apache.org/jira/browse/SPARK-11371 Project: Spark Issue Type: Improvement Reporter: Ted Yu Priority: Minor >From Reynold in the thread 'Exception when using some aggregate operators' >(http://search-hadoop.com/m/q3RTt0xFr22nXB4/): I don't think these are bugs. The SQL standard for average is "avg", not "mean". Similarly, a distinct count is supposed to be written as "count(distinct col)", not "countDistinct(col)". We can, however, make "mean" an alias for "avg" to improve compatibility between DataFrame and SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11371) Make "mean" an alias for "avg" operator
[ https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-11371: --- Attachment: spark-11371-v1.patch > Make "mean" an alias for "avg" operator > --- > > Key: SPARK-11371 > URL: https://issues.apache.org/jira/browse/SPARK-11371 > Project: Spark > Issue Type: Improvement >Reporter: Ted Yu >Priority: Minor > Attachments: spark-11371-v1.patch > > > From Reynold in the thread 'Exception when using some aggregate operators' > (http://search-hadoop.com/m/q3RTt0xFr22nXB4/): > I don't think these are bugs. The SQL standard for average is "avg", not > "mean". Similarly, a distinct count is supposed to be written as > "count(distinct col)", not "countDistinct(col)". > We can, however, make "mean" an alias for "avg" to improve compatibility > between DataFrame and SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11372) custom UDAF with StringType throws java.lang.ClassCastException
Pravin Gadakh created SPARK-11372: - Summary: custom UDAF with StringType throws java.lang.ClassCastException Key: SPARK-11372 URL: https://issues.apache.org/jira/browse/SPARK-11372 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Pravin Gadakh Priority: Minor Consider following custom UDAF which uses one StringType column as intermediate buffer's schema: {code:title=Merge.scala|borderStyle=solid} class Merge extends UserDefinedAggregateFunction { override def inputSchema: StructType = StructType(StructField("value", StringType) :: Nil) override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { buffer(0) = if (buffer.getAs[String](0).isEmpty) { input.getAs[String](0) } else { buffer.getAs[String](0) + "," + input.getAs[String](0) } } override def bufferSchema: StructType = StructType( StructField("merge", StringType) :: Nil ) override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { buffer1(0) = if (buffer1.getAs[String](0).isEmpty) { buffer2.getAs[String](0) } else if (!buffer2.getAs[String](0).isEmpty) { buffer1.getAs[String](0) + "," + buffer2.getAs[String](0) } } override def initialize(buffer: MutableAggregationBuffer): Unit = { buffer(0) = "" } override def deterministic: Boolean = true override def evaluate(buffer: Row): Any = { buffer.getAs[String](0) } override def dataType: DataType = StringType } {code} Running the UDAF: {code} val df: DataFrame = sc.parallelize(List("abcd", "efgh", "hijk")).toDF("alpha") val merge = new Merge() df.groupBy().agg(merge(col("alpha")).as("Merge")).show() {code} Throw following exception: {code} java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getUTF8String(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:373) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:362) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:141) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(
[jira] [Updated] (SPARK-11372) custom UDAF with StringType throws java.lang.ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-11372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pravin Gadakh updated SPARK-11372: -- Description: Consider following custom UDAF which uses one StringType column as intermediate buffer's schema: {code:title=Merge.scala|borderStyle=solid} class Merge extends UserDefinedAggregateFunction { override def inputSchema: StructType = StructType(StructField("value", StringType) :: Nil) override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { buffer(0) = if (buffer.getAs[String](0).isEmpty) { input.getAs[String](0) } else { buffer.getAs[String](0) + "," + input.getAs[String](0) } } override def bufferSchema: StructType = StructType( StructField("merge", StringType) :: Nil ) override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { buffer1(0) = if (buffer1.getAs[String](0).isEmpty) { buffer2.getAs[String](0) } else if (!buffer2.getAs[String](0).isEmpty) { buffer1.getAs[String](0) + "," + buffer2.getAs[String](0) } } override def initialize(buffer: MutableAggregationBuffer): Unit = { buffer(0) = "" } override def deterministic: Boolean = true override def evaluate(buffer: Row): Any = { buffer.getAs[String](0) } override def dataType: DataType = StringType } {code} Running the UDAF: {code} val df: DataFrame = sc.parallelize(List("abcd", "efgh", "hijk")).toDF("alpha") val merge = new Merge() df.groupBy().agg(merge(col("alpha")).as("Merge")).show() {code} Throw following exception: {code} java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getUTF8String(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:373) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:362) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:141) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} was: Consider following custom UDAF which uses one StringType column as intermediate buffer's schema: {code:title=Merge.scala|borderStyle=solid} class Merge extends UserDefinedAggregateFunction { override def inputSchema: StructType = StructType(StructField("value", StringType) :: Nil) override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { buffer(0) = if (buffer.getAs[String](0).isEmpty) {
[jira] [Updated] (SPARK-11372) custom UDAF with StringType throws java.lang.ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-11372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pravin Gadakh updated SPARK-11372: -- Description: Consider following custom UDAF which uses one StringType column as intermediate buffer's schema: {code:title=Merge.scala|borderStyle=solid} class Merge extends UserDefinedAggregateFunction { override def inputSchema: StructType = StructType(StructField("value", StringType) :: Nil) override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { buffer(0) = if (buffer.getAs[String](0).isEmpty) { input.getAs[String](0) } else { buffer.getAs[String](0) + "," + input.getAs[String](0) } } override def bufferSchema: StructType = StructType( StructField("merge", StringType) :: Nil ) override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { buffer1(0) = if (buffer1.getAs[String](0).isEmpty) { buffer2.getAs[String](0) } else if (!buffer2.getAs[String](0).isEmpty) { buffer1.getAs[String](0) + "," + buffer2.getAs[String](0) } } override def initialize(buffer: MutableAggregationBuffer): Unit = { buffer(0) = "" } override def deterministic: Boolean = true override def evaluate(buffer: Row): Any = { buffer.getAs[String](0) } override def dataType: DataType = StringType } {code} Running the UDAF: {code} val df: DataFrame = sc.parallelize(List("abcd", "efgh", "hijk")).toDF("alpha") val merge = new Merge() df.groupBy().agg(merge(col("alpha")).as("Merge")).show() {code} Throw following exception: {code} java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getUTF8String(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:373) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:362) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:141) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} was: Consider following custom UDAF which uses one StringType column as intermediate buffer's schema: {code:title=Merge.scala|borderStyle=solid} class Merge extends UserDefinedAggregateFunction { override def inputSchema: StructType = StructType(StructField("value", StringType) :: Nil) override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { buffer(0) = if (buffer.getAs[String](0).isEmpty) { input.getAs[String](0) } else { buffer.getAs[String](0) + "," + input.getAs[String](0) } } override def bufferSchema: StructType = StructType( StructField("merge", StringType) :: Nil ) ov
[jira] [Updated] (SPARK-11372) custom UDAF with StringType throws java.lang.ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-11372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pravin Gadakh updated SPARK-11372: -- Description: Consider following custom UDAF which uses StringType column as intermediate buffer's schema: {code:title=Merge.scala|borderStyle=solid} class Merge extends UserDefinedAggregateFunction { override def inputSchema: StructType = StructType(StructField("value", StringType) :: Nil) override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { buffer(0) = if (buffer.getAs[String](0).isEmpty) { input.getAs[String](0) } else { buffer.getAs[String](0) + "," + input.getAs[String](0) } } override def bufferSchema: StructType = StructType( StructField("merge", StringType) :: Nil ) override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { buffer1(0) = if (buffer1.getAs[String](0).isEmpty) { buffer2.getAs[String](0) } else if (!buffer2.getAs[String](0).isEmpty) { buffer1.getAs[String](0) + "," + buffer2.getAs[String](0) } } override def initialize(buffer: MutableAggregationBuffer): Unit = { buffer(0) = "" } override def deterministic: Boolean = true override def evaluate(buffer: Row): Any = { buffer.getAs[String](0) } override def dataType: DataType = StringType } {code} Running the UDAF: {code} val df: DataFrame = sc.parallelize(List("abcd", "efgh", "hijk")).toDF("alpha") val merge = new Merge() df.groupBy().agg(merge(col("alpha")).as("Merge")).show() {code} Throw following exception: {code} java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getUTF8String(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:373) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:362) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:141) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} was: Consider following custom UDAF which uses one StringType column as intermediate buffer's schema: {code:title=Merge.scala|borderStyle=solid} class Merge extends UserDefinedAggregateFunction { override def inputSchema: StructType = StructType(StructField("value", StringType) :: Nil) override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { buffer(0) = if (buffer.getAs[String](0).isEmpty) { input.getAs[String](0) } else { buffer.getAs[String](0) + "," + input.getAs[String](0) } } override def bufferSchema: StructType = StructType( StructField("merge", StringType) :: Nil ) overri
[jira] [Updated] (SPARK-11372) custom UDAF with StringType throws java.lang.ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-11372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pravin Gadakh updated SPARK-11372: -- Description: Consider following custom UDAF which uses StringType column as intermediate buffer's schema: {code:title=Merge.scala|borderStyle=solid} class Merge extends UserDefinedAggregateFunction { override def inputSchema: StructType = StructType(StructField("value", StringType) :: Nil) override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { buffer(0) = if (buffer.getAs[String](0).isEmpty) { input.getAs[String](0) } else { buffer.getAs[String](0) + "," + input.getAs[String](0) } } override def bufferSchema: StructType = StructType( StructField("merge", StringType) :: Nil ) override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { buffer1(0) = if (buffer1.getAs[String](0).isEmpty) { buffer2.getAs[String](0) } else if (!buffer2.getAs[String](0).isEmpty) { buffer1.getAs[String](0) + "," + buffer2.getAs[String](0) } } override def initialize(buffer: MutableAggregationBuffer): Unit = { buffer(0) = "" } override def deterministic: Boolean = true override def evaluate(buffer: Row): Any = { buffer.getAs[String](0) } override def dataType: DataType = StringType } {code} Running the UDAF: {code} val df: DataFrame = sc.parallelize(List("abcd", "efgh", "hijk")).toDF("alpha") val merge = new Merge() df.groupBy().agg(merge(col("alpha")).as("Merge")).show() {code} Throws following exception: {code} java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.unsafe.types.UTF8String at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getUTF8String(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:373) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$32.apply(AggregationIterator.scala:362) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:141) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} was: Consider following custom UDAF which uses StringType column as intermediate buffer's schema: {code:title=Merge.scala|borderStyle=solid} class Merge extends UserDefinedAggregateFunction { override def inputSchema: StructType = StructType(StructField("value", StringType) :: Nil) override def update(buffer: MutableAggregationBuffer, input: Row): Unit = { buffer(0) = if (buffer.getAs[String](0).isEmpty) { input.getAs[String](0) } else { buffer.getAs[String](0) + "," + input.getAs[String](0) } } override def bufferSchema: StructType = StructType( StructField("merge", StringType) :: Nil ) override
[jira] [Updated] (SPARK-11332) WeightedLeastSquares should use ml features generic Instance class instead of private
[ https://issues.apache.org/jira/browse/SPARK-11332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11332: -- Assignee: Nakul Jindal > WeightedLeastSquares should use ml features generic Instance class instead of > private > - > > Key: SPARK-11332 > URL: https://issues.apache.org/jira/browse/SPARK-11332 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: holdenk >Assignee: Nakul Jindal >Priority: Minor > Fix For: 1.6.0 > > > WeightedLeastSquares should use the common Instance class in ml.feature > instead of a private one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11369) SparkR glm should support setting standardize
[ https://issues.apache.org/jira/browse/SPARK-11369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11369: Assignee: Apache Spark > SparkR glm should support setting standardize > - > > Key: SPARK-11369 > URL: https://issues.apache.org/jira/browse/SPARK-11369 > Project: Spark > Issue Type: Improvement > Components: ML, R >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > SparkR glm currently support : > formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0 > We should also support setting standardize which has been defined at [design > documentation|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#heading=h.b2igi6cx7yf] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11369) SparkR glm should support setting standardize
[ https://issues.apache.org/jira/browse/SPARK-11369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978256#comment-14978256 ] Apache Spark commented on SPARK-11369: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/9331 > SparkR glm should support setting standardize > - > > Key: SPARK-11369 > URL: https://issues.apache.org/jira/browse/SPARK-11369 > Project: Spark > Issue Type: Improvement > Components: ML, R >Reporter: Yanbo Liang >Priority: Minor > > SparkR glm currently support : > formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0 > We should also support setting standardize which has been defined at [design > documentation|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#heading=h.b2igi6cx7yf] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11369) SparkR glm should support setting standardize
[ https://issues.apache.org/jira/browse/SPARK-11369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11369: Assignee: (was: Apache Spark) > SparkR glm should support setting standardize > - > > Key: SPARK-11369 > URL: https://issues.apache.org/jira/browse/SPARK-11369 > Project: Spark > Issue Type: Improvement > Components: ML, R >Reporter: Yanbo Liang >Priority: Minor > > SparkR glm currently support : > formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0 > We should also support setting standardize which has been defined at [design > documentation|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#heading=h.b2igi6cx7yf] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11317) YARN HBase token code shouldn't swallow invocation target exceptions
[ https://issues.apache.org/jira/browse/SPARK-11317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-11317: --- Affects Version/s: 1.5.1 > YARN HBase token code shouldn't swallow invocation target exceptions > > > Key: SPARK-11317 > URL: https://issues.apache.org/jira/browse/SPARK-11317 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 >Reporter: Steve Loughran > > As with SPARK-11265; the HBase token retrieval code of SPARK-6918 > 1. swallows exceptions it should be rethrowing as serious problems (e.g > NoSuchMethodException) > 1. Swallows any exception raised by the HBase client, without even logging > the details (it logs that an `InvocationTargetException` was caught, but not > the contents) > As such it is potentially brittle to changes in the HBase client code, and > absolutely not going to provide any assistance if HBase won't actually issue > tokens to the caller. > The code in SPARK-11265 can be re-used to provide consistent and better > exception processing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11317) YARN HBase token code shouldn't swallow invocation target exceptions
[ https://issues.apache.org/jira/browse/SPARK-11317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-11317: --- Description: As with SPARK-11265; the HBase token retrieval code of SPARK-6918 1. swallows exceptions it should be rethrowing as serious problems (e.g NoSuchMethodException) 1. Swallows any exception raised by the HBase client, without even logging the details (it logs that an `InvocationTargetException` was caught, but not the contents) As such it is potentially brittle to changes in the HBase client code, and absolutely not going to provide any assistance if HBase won't actually issue tokens to the caller. The code in SPARK-11265 can be re-used to provide consistent and better exception processing was: As with SPARK-11265; the HBase token retrieval code of SPARK-6918 1. swallows exceptions it should be rethrowing as serious problems (e.g NoSuchMethodException) 1. Swallows any exception raised by the HBase client, without even logging the details (it logs that an `InvocationTargetException` was caught, but not the contents) As such it is potentially brittle to changes in the HDFS client code, and absolutely not going to provide any assistance if HBase won't actually issue tokens to the caller. The code in SPARK-11265 can be re-used to provide consistent and better exception processing > YARN HBase token code shouldn't swallow invocation target exceptions > > > Key: SPARK-11317 > URL: https://issues.apache.org/jira/browse/SPARK-11317 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 >Reporter: Steve Loughran > > As with SPARK-11265; the HBase token retrieval code of SPARK-6918 > 1. swallows exceptions it should be rethrowing as serious problems (e.g > NoSuchMethodException) > 1. Swallows any exception raised by the HBase client, without even logging > the details (it logs that an `InvocationTargetException` was caught, but not > the contents) > As such it is potentially brittle to changes in the HBase client code, and > absolutely not going to provide any assistance if HBase won't actually issue > tokens to the caller. > The code in SPARK-11265 can be re-used to provide consistent and better > exception processing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11373) Add metrics to the History Server and providers
Steve Loughran created SPARK-11373: -- Summary: Add metrics to the History Server and providers Key: SPARK-11373 URL: https://issues.apache.org/jira/browse/SPARK-11373 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.5.1 Reporter: Steve Loughran The History server doesn't publish metrics about JVM load or anything from the history provider plugins. This means that performance problems from massive job histories aren't visible to management tools, and nor are any provider-generated metrics such as time to load histories, failed history loads, the number of connectivity failures talking to remote services, etc. If the history server set up a metrics registry and offered the option to publish its metrics, then management tools could view this data. # the metrics registry would need to be passed down to the instantiated {{ApplicationHistoryProvider}}, in order for it to register its metrics. # if the codahale metrics servlet were registered under a path such as {{/metrics}}, the values would be visible as HTML and JSON, without the need for management tools. # Integration tests could also retrieve the JSON-formatted data and use it as part of the test suites. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11373) Add metrics to the History Server and providers
[ https://issues.apache.org/jira/browse/SPARK-11373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978322#comment-14978322 ] Steve Loughran commented on SPARK-11373: # This has tangible benefit for the SPARK-1537 YARN ATS binding, because connectivity failures, GET performance and similar do surface. There are some {{AtomicLong}} counters in its {{YarnHistoryProvider}}, but I'm not planning to add counters and metrics until after that is checked in. # All providers will benefit from the standard JVM performance counters, GC &c. # the FS history provider could also track time to list and load histories; time of last refresh, time to load most recent history, etc —information needed to identify where an unresponsive UI is getting its problems from. > Add metrics to the History Server and providers > --- > > Key: SPARK-11373 > URL: https://issues.apache.org/jira/browse/SPARK-11373 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Steve Loughran > > The History server doesn't publish metrics about JVM load or anything from > the history provider plugins. This means that performance problems from > massive job histories aren't visible to management tools, and nor are any > provider-generated metrics such as time to load histories, failed history > loads, the number of connectivity failures talking to remote services, etc. > If the history server set up a metrics registry and offered the option to > publish its metrics, then management tools could view this data. > # the metrics registry would need to be passed down to the instantiated > {{ApplicationHistoryProvider}}, in order for it to register its metrics. > # if the codahale metrics servlet were registered under a path such as > {{/metrics}}, the values would be visible as HTML and JSON, without the need > for management tools. > # Integration tests could also retrieve the JSON-formatted data and use it as > part of the test suites. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11374) skip.header.line.count is ignored in HiveContext
Daniel Haviv created SPARK-11374: Summary: skip.header.line.count is ignored in HiveContext Key: SPARK-11374 URL: https://issues.apache.org/jira/browse/SPARK-11374 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Reporter: Daniel Haviv csv table in Hive which is configured to skip the header row using TBLPROPERTIES("skip.header.line.count"="1"). When querying from Hive the header row is not included in the data, but when running the same query via HiveContext I get the header row. "show create table " via the HiveContext confirms that it is aware of the setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11375) History Server "no histories" message to be dynamically generated by ApplicationHistoryProviders
Steve Loughran created SPARK-11375: -- Summary: History Server "no histories" message to be dynamically generated by ApplicationHistoryProviders Key: SPARK-11375 URL: https://issues.apache.org/jira/browse/SPARK-11375 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.5.1 Reporter: Steve Loughran Priority: Minor When there are no histories, the {{HistoryPage}} displays an error text which assumes that the provider is the {{FsHistoryProvider}}, and its sole failure mode is "directory not found" {code} Did you specify the correct logging directory? Please verify your setting of spark.history.fs.logDirectory {code} Different providers have different failure modes, and even the filesystem provider has some, such as an access control exception, or the specified directly path actually being a file. If the {{ApplicationHistoryProvider}} was itself asked to provide an error message, then it could * be dynamically generated to show the current state of the history provider * potentially include any exceptions to list * display the actual values of settings such as the log directory property. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11375) History Server "no histories" message to be dynamically generated by ApplicationHistoryProviders
[ https://issues.apache.org/jira/browse/SPARK-11375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978329#comment-14978329 ] Steve Loughran commented on SPARK-11375: This could be implemented with a new method on {{ApplicationHistoryProvider}}; something like {code} getDiagnosticsInfo(): (String, Option[String]) = { ... } {code} which would return two strings: one simple text, and one formatted text for insertion into a {{}} section. That would allow stack traces to be displayed readably. A default implementation would simply return the current message. Note that the HTML would have to be sanitized before display, with angle brackets escaped. > History Server "no histories" message to be dynamically generated by > ApplicationHistoryProviders > > > Key: SPARK-11375 > URL: https://issues.apache.org/jira/browse/SPARK-11375 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Steve Loughran >Priority: Minor > > When there are no histories, the {{HistoryPage}} displays an error text which > assumes that the provider is the {{FsHistoryProvider}}, and its sole failure > mode is "directory not found" > {code} > Did you specify the correct logging directory? > Please verify your setting of spark.history.fs.logDirectory > {code} > Different providers have different failure modes, and even the filesystem > provider has some, such as an access control exception, or the specified > directly path actually being a file. > If the {{ApplicationHistoryProvider}} was itself asked to provide an error > message, then it could > * be dynamically generated to show the current state of the history provider > * potentially include any exceptions to list > * display the actual values of settings such as the log directory property. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11313) Implement cogroup
[ https://issues.apache.org/jira/browse/SPARK-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11313. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9324 [https://github.com/apache/spark/pull/9324] > Implement cogroup > - > > Key: SPARK-11313 > URL: https://issues.apache.org/jira/browse/SPARK-11313 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11313) Implement cogroup
[ https://issues.apache.org/jira/browse/SPARK-11313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11313: - Assignee: Wenchen Fan > Implement cogroup > - > > Key: SPARK-11313 > URL: https://issues.apache.org/jira/browse/SPARK-11313 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978406#comment-14978406 ] Michael Armbrust commented on SPARK-11303: -- I picked it into branch-1.5, but I'm not sure if it made the cut off. [~rxin]? > sample (without replacement) + filter returns wrong results in DataFrame > > > Key: SPARK-11303 > URL: https://issues.apache.org/jira/browse/SPARK-11303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: pyspark local mode, linux. >Reporter: Yuval Tanny > Fix For: 1.6.0 > > > When sampling and then filtering DataFrame from python, we get inconsistent > result when not caching the sampled DataFrame. This bug doesn't appear in > spark 1.4.1. > {code} > d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t']) > d_sampled = d.sample(False, 0.1, 1) > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > d_sampled.cache() > print d_sampled.count() > print d_sampled.filter('t = 1').count() > print d_sampled.filter('t != 1').count() > {code} > output: > {code} > 14 > 7 > 8 > 14 > 7 > 7 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11368) Spark scan all partitions when using Python UDF and filter over partitioned column is given
[ https://issues.apache.org/jira/browse/SPARK-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978412#comment-14978412 ] Maciej Bryński edited comment on SPARK-11368 at 10/28/15 1:27 PM: -- Problem exists only when using Pyspark. When I did the same in Scala console everything is OK. {code} scala> sqlContext.udf.register("multiply2", (x: Long) => x * 2) scala> time(sqlContext.sql("select * from df where id > 990 and multiply2(value) > 20").count()) 353633 microseconds == Physical Plan == TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#32L]) TungstenExchange SinglePartition TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#35L]) Project Filter (UDF(value#0L) > 20) Scan ParquetRelation[file:/mnt/mfs/udf_test][value#0L] {code} was (Author: maver1ck): Problem exists only when using Pyspark. When I did the same in Scala console everything is OK. {code} scala> sqlContext.udf.register("multiply2", (x: Long) => x * 2 scala> time(sqlContext.sql("select * from df where id > 990 and multiply2(value) > 20").count()) 353633 microseconds == Physical Plan == TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#32L]) TungstenExchange SinglePartition TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#35L]) Project Filter (UDF(value#0L) > 20) Scan ParquetRelation[file:/mnt/mfs/udf_test][value#0L] {code} > Spark scan all partitions when using Python UDF and filter over partitioned > column is given > --- > > Key: SPARK-11368 > URL: https://issues.apache.org/jira/browse/SPARK-11368 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Maciej Bryński >Priority: Critical > > Hi, > I think this is huge performance bug. > I created parquet file partitioned by column. > Then I make query with filter over partition column and filter with UDF. > Result is that all partition are scanned. > Sample data: > {code} > rdd = sc.parallelize(range(0,1000)).map(lambda x: > Row(id=x%1000,value=x)).repartition(1) > df = sqlCtx.createDataFrame(rdd) > df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test') > df = sqlCtx.read.parquet('/mnt/mfs/udf_test') > df.registerTempTable('df') > {code} > Then queries: > Without udf - Spark reads only 10 partitions: > {code} > start = time.time() > sqlCtx.sql('select * from df where id >= 990 and value > 10').count() > print(time.time() - start) > 0.9993703365325928 > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[count#22L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#25L]) >Project > Filter (value#5L > 10) > Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L] > {code} > With udf Spark reads all the partitions: > {code} > sqlCtx.registerFunction('multiply2', lambda x: x *2 ) > start = time.time() > sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > > 20').count() > print(time.time() - start) > 13.0826096534729 > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[count#34L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#37L]) >TungstenProject > Filter ((id#6 > 990) && (cast(pythonUDF#33 as double) > 20.0)) > !BatchPythonEvaluation PythonUDF#multiply2(value#5L), > [value#5L,id#6,pythonUDF#33] > Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L,id#6] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11368) Spark scan all partitions when using Python UDF and filter over partitioned column is given
[ https://issues.apache.org/jira/browse/SPARK-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978412#comment-14978412 ] Maciej Bryński commented on SPARK-11368: Problem exists only when using Pyspark. When I did the same in Scala console everything is OK. {code} scala> sqlContext.udf.register("multiply2", (x: Long) => x * 2 scala> time(sqlContext.sql("select * from df where id > 990 and multiply2(value) > 20").count()) 353633 microseconds == Physical Plan == TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#32L]) TungstenExchange SinglePartition TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[currentCount#35L]) Project Filter (UDF(value#0L) > 20) Scan ParquetRelation[file:/mnt/mfs/udf_test][value#0L] {code} > Spark scan all partitions when using Python UDF and filter over partitioned > column is given > --- > > Key: SPARK-11368 > URL: https://issues.apache.org/jira/browse/SPARK-11368 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Maciej Bryński >Priority: Critical > > Hi, > I think this is huge performance bug. > I created parquet file partitioned by column. > Then I make query with filter over partition column and filter with UDF. > Result is that all partition are scanned. > Sample data: > {code} > rdd = sc.parallelize(range(0,1000)).map(lambda x: > Row(id=x%1000,value=x)).repartition(1) > df = sqlCtx.createDataFrame(rdd) > df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test') > df = sqlCtx.read.parquet('/mnt/mfs/udf_test') > df.registerTempTable('df') > {code} > Then queries: > Without udf - Spark reads only 10 partitions: > {code} > start = time.time() > sqlCtx.sql('select * from df where id >= 990 and value > 10').count() > print(time.time() - start) > 0.9993703365325928 > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[count#22L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#25L]) >Project > Filter (value#5L > 10) > Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L] > {code} > With udf Spark reads all the partitions: > {code} > sqlCtx.registerFunction('multiply2', lambda x: x *2 ) > start = time.time() > sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > > 20').count() > print(time.time() - start) > 13.0826096534729 > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[count#34L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#37L]) >TungstenProject > Filter ((id#6 > 990) && (cast(pythonUDF#33 as double) > 20.0)) > !BatchPythonEvaluation PythonUDF#multiply2(value#5L), > [value#5L,id#6,pythonUDF#33] > Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L,id#6] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11368) Spark shouldn't scan all partitions when using Python UDF and filter over partitioned column is given
[ https://issues.apache.org/jira/browse/SPARK-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-11368: --- Summary: Spark shouldn't scan all partitions when using Python UDF and filter over partitioned column is given (was: Spark scan all partitions when using Python UDF and filter over partitioned column is given) > Spark shouldn't scan all partitions when using Python UDF and filter over > partitioned column is given > - > > Key: SPARK-11368 > URL: https://issues.apache.org/jira/browse/SPARK-11368 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Maciej Bryński >Priority: Critical > > Hi, > I think this is huge performance bug. > I created parquet file partitioned by column. > Then I make query with filter over partition column and filter with UDF. > Result is that all partition are scanned. > Sample data: > {code} > rdd = sc.parallelize(range(0,1000)).map(lambda x: > Row(id=x%1000,value=x)).repartition(1) > df = sqlCtx.createDataFrame(rdd) > df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test') > df = sqlCtx.read.parquet('/mnt/mfs/udf_test') > df.registerTempTable('df') > {code} > Then queries: > Without udf - Spark reads only 10 partitions: > {code} > start = time.time() > sqlCtx.sql('select * from df where id >= 990 and value > 10').count() > print(time.time() - start) > 0.9993703365325928 > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[count#22L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#25L]) >Project > Filter (value#5L > 10) > Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L] > {code} > With udf Spark reads all the partitions: > {code} > sqlCtx.registerFunction('multiply2', lambda x: x *2 ) > start = time.time() > sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > > 20').count() > print(time.time() - start) > 13.0826096534729 > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[count#34L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#37L]) >TungstenProject > Filter ((id#6 > 990) && (cast(pythonUDF#33 as double) > 20.0)) > !BatchPythonEvaluation PythonUDF#multiply2(value#5L), > [value#5L,id#6,pythonUDF#33] > Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L,id#6] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11371) Make "mean" an alias for "avg" operator
[ https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11371: Assignee: (was: Apache Spark) > Make "mean" an alias for "avg" operator > --- > > Key: SPARK-11371 > URL: https://issues.apache.org/jira/browse/SPARK-11371 > Project: Spark > Issue Type: Improvement >Reporter: Ted Yu >Priority: Minor > Attachments: spark-11371-v1.patch > > > From Reynold in the thread 'Exception when using some aggregate operators' > (http://search-hadoop.com/m/q3RTt0xFr22nXB4/): > I don't think these are bugs. The SQL standard for average is "avg", not > "mean". Similarly, a distinct count is supposed to be written as > "count(distinct col)", not "countDistinct(col)". > We can, however, make "mean" an alias for "avg" to improve compatibility > between DataFrame and SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11371) Make "mean" an alias for "avg" operator
[ https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978482#comment-14978482 ] Apache Spark commented on SPARK-11371: -- User 'ted-yu' has created a pull request for this issue: https://github.com/apache/spark/pull/9332 > Make "mean" an alias for "avg" operator > --- > > Key: SPARK-11371 > URL: https://issues.apache.org/jira/browse/SPARK-11371 > Project: Spark > Issue Type: Improvement >Reporter: Ted Yu >Priority: Minor > Attachments: spark-11371-v1.patch > > > From Reynold in the thread 'Exception when using some aggregate operators' > (http://search-hadoop.com/m/q3RTt0xFr22nXB4/): > I don't think these are bugs. The SQL standard for average is "avg", not > "mean". Similarly, a distinct count is supposed to be written as > "count(distinct col)", not "countDistinct(col)". > We can, however, make "mean" an alias for "avg" to improve compatibility > between DataFrame and SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor
Cheng Lian created SPARK-11376: -- Summary: Invalid generated Java code in GenerateColumnAccessor Key: SPARK-11376 URL: https://issues.apache.org/jira/browse/SPARK-11376 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor There are two {{mutableRow}} fields in the generated code within {{GenerateColumnAccessor.create()}} method. However, Janino happily accepted this code and operates normally. (That's why this bug is marked as minor.) After some experiments, it turned out that Janino doesn't complain as long as all the fields with the same name are null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11377) withNewChildren should not convert StructType to Seq
Michael Armbrust created SPARK-11377: Summary: withNewChildren should not convert StructType to Seq Key: SPARK-11377 URL: https://issues.apache.org/jira/browse/SPARK-11377 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Reporter: Michael Armbrust Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11371) Make "mean" an alias for "avg" operator
[ https://issues.apache.org/jira/browse/SPARK-11371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11371: Assignee: Apache Spark > Make "mean" an alias for "avg" operator > --- > > Key: SPARK-11371 > URL: https://issues.apache.org/jira/browse/SPARK-11371 > Project: Spark > Issue Type: Improvement >Reporter: Ted Yu >Assignee: Apache Spark >Priority: Minor > Attachments: spark-11371-v1.patch > > > From Reynold in the thread 'Exception when using some aggregate operators' > (http://search-hadoop.com/m/q3RTt0xFr22nXB4/): > I don't think these are bugs. The SQL standard for average is "avg", not > "mean". Similarly, a distinct count is supposed to be written as > "count(distinct col)", not "countDistinct(col)". > We can, however, make "mean" an alias for "avg" to improve compatibility > between DataFrame and SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11377) withNewChildren should not convert StructType to Seq
[ https://issues.apache.org/jira/browse/SPARK-11377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11377: Assignee: Michael Armbrust (was: Apache Spark) > withNewChildren should not convert StructType to Seq > > > Key: SPARK-11377 > URL: https://issues.apache.org/jira/browse/SPARK-11377 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Michael Armbrust >Assignee: Michael Armbrust > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11377) withNewChildren should not convert StructType to Seq
[ https://issues.apache.org/jira/browse/SPARK-11377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978505#comment-14978505 ] Apache Spark commented on SPARK-11377: -- User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/9334 > withNewChildren should not convert StructType to Seq > > > Key: SPARK-11377 > URL: https://issues.apache.org/jira/browse/SPARK-11377 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Michael Armbrust >Assignee: Michael Armbrust > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11377) withNewChildren should not convert StructType to Seq
[ https://issues.apache.org/jira/browse/SPARK-11377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11377: Assignee: Apache Spark (was: Michael Armbrust) > withNewChildren should not convert StructType to Seq > > > Key: SPARK-11377 > URL: https://issues.apache.org/jira/browse/SPARK-11377 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Michael Armbrust >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor
[ https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978509#comment-14978509 ] Apache Spark commented on SPARK-11376: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/9335 > Invalid generated Java code in GenerateColumnAccessor > - > > Key: SPARK-11376 > URL: https://issues.apache.org/jira/browse/SPARK-11376 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > There are two {{mutableRow}} fields in the generated code within > {{GenerateColumnAccessor.create()}} method. However, Janino happily accepted > this code and operates normally. (That's why this bug is marked as minor.) > After some experiments, it turned out that Janino doesn't complain as long as > all the fields with the same name are null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor
[ https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11376: Assignee: Apache Spark (was: Cheng Lian) > Invalid generated Java code in GenerateColumnAccessor > - > > Key: SPARK-11376 > URL: https://issues.apache.org/jira/browse/SPARK-11376 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Cheng Lian >Assignee: Apache Spark >Priority: Minor > > There are two {{mutableRow}} fields in the generated code within > {{GenerateColumnAccessor.create()}} method. However, Janino happily accepted > this code and operates normally. (That's why this bug is marked as minor.) > After some experiments, it turned out that Janino doesn't complain as long as > all the fields with the same name are null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor
[ https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11376: Assignee: Cheng Lian (was: Apache Spark) > Invalid generated Java code in GenerateColumnAccessor > - > > Key: SPARK-11376 > URL: https://issues.apache.org/jira/browse/SPARK-11376 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > There are two {{mutableRow}} fields in the generated code within > {{GenerateColumnAccessor.create()}} method. However, Janino happily accepted > this code and operates normally. (That's why this bug is marked as minor.) > After some experiments, it turned out that Janino doesn't complain as long as > all the fields with the same name are null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results
[ https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978528#comment-14978528 ] Saif Addin Ellafi commented on SPARK-11330: --- Hello Cheng Hao, and thank you very much for trying to reproduce. You are right, this does not happen on very small data. I started reproducing it with a slightly more complex +100 rows of data. This is the smallest I can do, I am not sure I understand the issue completely. 1. Put in json file {"mdl_loan_state":"PPD00 ","filemonth_dt":"01NOV2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2014"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2008"} {"mdl_loan_state":"CF0 ","filemonth_dt":"01OCT2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01AUG2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2010"} {"mdl_loan_state":"PPD56 ","filemonth_dt":"01NOV2009"} {"mdl_loan_state":"PPD3 ","filemonth_dt":"01SEP2012"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2012"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2012"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2012"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2014"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01AUG2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2014"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2014"} {"mdl_loan_state":"CF0 ","filemonth_dt":"01NOV2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2014"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2012"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01NOV2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01AUG2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2012"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2012"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01AUG2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2008"} {"mdl_loan_state":"PPD00 ","filem
[jira] [Updated] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results
[ https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saif Addin Ellafi updated SPARK-11330: -- Attachment: bug_reproduce.zip Json dataset folder > Filter operation on StringType after groupBy PERSISTED brings no results > > > Key: SPARK-11330 > URL: https://issues.apache.org/jira/browse/SPARK-11330 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 > Environment: Stand alone Cluster of five servers (happens as well in > local mode). sqlContext instance of HiveContext (happens as well with > SQLContext) > No special options other than driver memory and executor memory. > Parquet partitions are 512 where there are 160 cores. Happens as well with > other partitioning > Data is nearly 2 billion rows. >Reporter: Saif Addin Ellafi >Priority: Blocker > Attachments: bug_reproduce.zip, bug_reproduce.zip > > > ONLY HAPPENS WHEN PERSIST() IS CALLED > val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt") > data.groupBy("vintages").count.select("vintages").filter("vintages = > '2007-01-01'").first > >>> res9: org.apache.spark.sql.Row = [2007-01-01] > data.groupBy("vintages").count.persist.select("vintages").filter("vintages = > '2007-01-01'").first > >>> Exception on empty iterator stuff > This does not happen if using another type of field, eg IntType > data.groupBy("mm").count.persist.select("mm").filter("mm = > 200805").first >>> res13: org.apache.spark.sql.Row = [200805] > NOTE: I have no idea whether I used KRYO serializer when stored this parquet. > NOTE2: If setting the persist after the filtering, it works fine. But this is > not a good enough workaround since any filter operation afterwards will break > results. > NOTE3: I have reproduced the issue with several different datasets. > NOTE4: The only real workaround is to store the groupBy dataframe in database > and reload it as a new dataframe. > Query to raw-data works fine: > data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: > org.apache.spark.sql.Row = [2007-01-01] > Originally, the issue happened with a larger aggregation operation, the > result was that data was inconsistent bringing different results every call. > Reducing the operation step by step, I got into this issue. > In any case, the original operation was: > val data = sqlContext.read.parquet("/var/Saif/data_pqt") > val res = data.groupBy("product", "band", "age", "vint", "mb", > "mm").agg(count($"account_id").as("N"), > sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), > sum($"spend").as("spend"), sum($"payment").as("payment"), > sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" > === 1).as("newacct")).persist() > val z = res.select("vint", "mm").filter("vint = > '2007-01-01'").select("mm").distinct.collect > z.length > >>> res0: Int = 102 > res.unpersist() > val z = res.select("vint", "mm").filter("vint = > '2007-01-01'").select("mm").distinct.collect > z.length > >>> res1: Int = 103 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11378) StreamingContext.awaitTerminationOrTimeout does not return
Nick Evans created SPARK-11378: -- Summary: StreamingContext.awaitTerminationOrTimeout does not return Key: SPARK-11378 URL: https://issues.apache.org/jira/browse/SPARK-11378 Project: Spark Issue Type: Bug Components: PySpark, Streaming Affects Versions: 1.5.1 Reporter: Nick Evans Priority: Minor The docs for {{SparkContext.awaitTerminationOrTimeout}} state it will "Return `true` if it's stopped; (...) or `false` if the waiting time elapsed before returning from the method." This is currently not the case - the function does not return and thus any logic built on awaitTerminationOrTimeout will not work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results
[ https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saif Addin Ellafi updated SPARK-11330: -- Attachment: bug_reproduce.zip > Filter operation on StringType after groupBy PERSISTED brings no results > > > Key: SPARK-11330 > URL: https://issues.apache.org/jira/browse/SPARK-11330 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 > Environment: Stand alone Cluster of five servers (happens as well in > local mode). sqlContext instance of HiveContext (happens as well with > SQLContext) > No special options other than driver memory and executor memory. > Parquet partitions are 512 where there are 160 cores. Happens as well with > other partitioning > Data is nearly 2 billion rows. >Reporter: Saif Addin Ellafi >Priority: Blocker > Attachments: bug_reproduce.zip > > > ONLY HAPPENS WHEN PERSIST() IS CALLED > val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt") > data.groupBy("vintages").count.select("vintages").filter("vintages = > '2007-01-01'").first > >>> res9: org.apache.spark.sql.Row = [2007-01-01] > data.groupBy("vintages").count.persist.select("vintages").filter("vintages = > '2007-01-01'").first > >>> Exception on empty iterator stuff > This does not happen if using another type of field, eg IntType > data.groupBy("mm").count.persist.select("mm").filter("mm = > 200805").first >>> res13: org.apache.spark.sql.Row = [200805] > NOTE: I have no idea whether I used KRYO serializer when stored this parquet. > NOTE2: If setting the persist after the filtering, it works fine. But this is > not a good enough workaround since any filter operation afterwards will break > results. > NOTE3: I have reproduced the issue with several different datasets. > NOTE4: The only real workaround is to store the groupBy dataframe in database > and reload it as a new dataframe. > Query to raw-data works fine: > data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: > org.apache.spark.sql.Row = [2007-01-01] > Originally, the issue happened with a larger aggregation operation, the > result was that data was inconsistent bringing different results every call. > Reducing the operation step by step, I got into this issue. > In any case, the original operation was: > val data = sqlContext.read.parquet("/var/Saif/data_pqt") > val res = data.groupBy("product", "band", "age", "vint", "mb", > "mm").agg(count($"account_id").as("N"), > sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), > sum($"spend").as("spend"), sum($"payment").as("payment"), > sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" > === 1).as("newacct")).persist() > val z = res.select("vint", "mm").filter("vint = > '2007-01-01'").select("mm").distinct.collect > z.length > >>> res0: Int = 102 > res.unpersist() > val z = res.select("vint", "mm").filter("vint = > '2007-01-01'").select("mm").distinct.collect > z.length > >>> res1: Int = 103 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11378) StreamingContext.awaitTerminationOrTimeout does not return
[ https://issues.apache.org/jira/browse/SPARK-11378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11378: Assignee: (was: Apache Spark) > StreamingContext.awaitTerminationOrTimeout does not return > -- > > Key: SPARK-11378 > URL: https://issues.apache.org/jira/browse/SPARK-11378 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming >Affects Versions: 1.5.1 >Reporter: Nick Evans >Priority: Minor > > The docs for {{SparkContext.awaitTerminationOrTimeout}} state it will "Return > `true` if it's stopped; (...) or `false` if the waiting time elapsed before > returning from the method." > This is currently not the case - the function does not return and thus any > logic built on awaitTerminationOrTimeout will not work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor
[ https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11376: --- Priority: Major (was: Minor) > Invalid generated Java code in GenerateColumnAccessor > - > > Key: SPARK-11376 > URL: https://issues.apache.org/jira/browse/SPARK-11376 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > There are two {{mutableRow}} fields in the generated code within > {{GenerateColumnAccessor.create()}} method. However, Janino happily accepted > this code and operates normally. (That's why this bug is marked as minor.) > After some experiments, it turned out that Janino doesn't complain as long as > all the fields with the same name are null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results
[ https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978528#comment-14978528 ] Saif Addin Ellafi edited comment on SPARK-11330 at 10/28/15 2:46 PM: - Hello Cheng Hao, and thank you very much for trying to reproduce. You are right, this does not happen on very small data. I started reproducing it with a slightly more complex +100 rows of data. This is the smallest I can do, I am not sure I understand the issue completely. 1. Put in json file (I have also attached a zip file with the dataset in case copy paste broke something) {"mdl_loan_state":"PPD00 ","filemonth_dt":"01NOV2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2014"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2008"} {"mdl_loan_state":"CF0 ","filemonth_dt":"01OCT2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01AUG2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2010"} {"mdl_loan_state":"PPD56 ","filemonth_dt":"01NOV2009"} {"mdl_loan_state":"PPD3 ","filemonth_dt":"01SEP2012"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2012"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2012"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUL2012"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2014"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01AUG2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2014"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2014"} {"mdl_loan_state":"CF0 ","filemonth_dt":"01NOV2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2014"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2012"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JUN2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01DEC2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01NOV2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01AUG2011"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2012"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01OCT2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01SEP2013"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01APR2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2012"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2010"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAR2008"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01JAN2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01MAY2009"} {"mdl_loan_state":"PPD00 ","filemonth_dt":"01FEB2008"} {"md
[jira] [Assigned] (SPARK-11378) StreamingContext.awaitTerminationOrTimeout does not return
[ https://issues.apache.org/jira/browse/SPARK-11378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11378: Assignee: Apache Spark > StreamingContext.awaitTerminationOrTimeout does not return > -- > > Key: SPARK-11378 > URL: https://issues.apache.org/jira/browse/SPARK-11378 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming >Affects Versions: 1.5.1 >Reporter: Nick Evans >Assignee: Apache Spark >Priority: Minor > > The docs for {{SparkContext.awaitTerminationOrTimeout}} state it will "Return > `true` if it's stopped; (...) or `false` if the waiting time elapsed before > returning from the method." > This is currently not the case - the function does not return and thus any > logic built on awaitTerminationOrTimeout will not work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11376) Invalid generated Java code in GenerateColumnAccessor
[ https://issues.apache.org/jira/browse/SPARK-11376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11376: --- Description: There are two {{mutableRow}} fields in the generated code within {{GenerateColumnAccessor.create()}} method. However, Janino happily accepted this code and accidentally operates normally. After some experiments, it turned out that Janino doesn't complain as long as all the fields with the same name are null. was: There are two {{mutableRow}} fields in the generated code within {{GenerateColumnAccessor.create()}} method. However, Janino happily accepted this code and operates normally. (That's why this bug is marked as minor.) After some experiments, it turned out that Janino doesn't complain as long as all the fields with the same name are null. > Invalid generated Java code in GenerateColumnAccessor > - > > Key: SPARK-11376 > URL: https://issues.apache.org/jira/browse/SPARK-11376 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > There are two {{mutableRow}} fields in the generated code within > {{GenerateColumnAccessor.create()}} method. However, Janino happily accepted > this code and accidentally operates normally. > After some experiments, it turned out that Janino doesn't complain as long as > all the fields with the same name are null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11378) StreamingContext.awaitTerminationOrTimeout does not return
[ https://issues.apache.org/jira/browse/SPARK-11378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978541#comment-14978541 ] Apache Spark commented on SPARK-11378: -- User 'manygrams' has created a pull request for this issue: https://github.com/apache/spark/pull/9336 > StreamingContext.awaitTerminationOrTimeout does not return > -- > > Key: SPARK-11378 > URL: https://issues.apache.org/jira/browse/SPARK-11378 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming >Affects Versions: 1.5.1 >Reporter: Nick Evans >Priority: Minor > > The docs for {{SparkContext.awaitTerminationOrTimeout}} state it will "Return > `true` if it's stopped; (...) or `false` if the waiting time elapsed before > returning from the method." > This is currently not the case - the function does not return and thus any > logic built on awaitTerminationOrTimeout will not work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11330) Filter operation on StringType after groupBy PERSISTED brings no results
[ https://issues.apache.org/jira/browse/SPARK-11330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saif Addin Ellafi updated SPARK-11330: -- Attachment: (was: bug_reproduce.zip) > Filter operation on StringType after groupBy PERSISTED brings no results > > > Key: SPARK-11330 > URL: https://issues.apache.org/jira/browse/SPARK-11330 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 > Environment: Stand alone Cluster of five servers (happens as well in > local mode). sqlContext instance of HiveContext (happens as well with > SQLContext) > No special options other than driver memory and executor memory. > Parquet partitions are 512 where there are 160 cores. Happens as well with > other partitioning > Data is nearly 2 billion rows. >Reporter: Saif Addin Ellafi >Priority: Blocker > Attachments: bug_reproduce.zip > > > ONLY HAPPENS WHEN PERSIST() IS CALLED > val data = sqlContext.read.parquet("/var/data/Saif/ppnr_pqt") > data.groupBy("vintages").count.select("vintages").filter("vintages = > '2007-01-01'").first > >>> res9: org.apache.spark.sql.Row = [2007-01-01] > data.groupBy("vintages").count.persist.select("vintages").filter("vintages = > '2007-01-01'").first > >>> Exception on empty iterator stuff > This does not happen if using another type of field, eg IntType > data.groupBy("mm").count.persist.select("mm").filter("mm = > 200805").first >>> res13: org.apache.spark.sql.Row = [200805] > NOTE: I have no idea whether I used KRYO serializer when stored this parquet. > NOTE2: If setting the persist after the filtering, it works fine. But this is > not a good enough workaround since any filter operation afterwards will break > results. > NOTE3: I have reproduced the issue with several different datasets. > NOTE4: The only real workaround is to store the groupBy dataframe in database > and reload it as a new dataframe. > Query to raw-data works fine: > data.select("vintages").filter("vintages = '2007-01-01'").first >>> res4: > org.apache.spark.sql.Row = [2007-01-01] > Originally, the issue happened with a larger aggregation operation, the > result was that data was inconsistent bringing different results every call. > Reducing the operation step by step, I got into this issue. > In any case, the original operation was: > val data = sqlContext.read.parquet("/var/Saif/data_pqt") > val res = data.groupBy("product", "band", "age", "vint", "mb", > "mm").agg(count($"account_id").as("N"), > sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), > sum($"spend").as("spend"), sum($"payment").as("payment"), > sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct" > === 1).as("newacct")).persist() > val z = res.select("vint", "mm").filter("vint = > '2007-01-01'").select("mm").distinct.collect > z.length > >>> res0: Int = 102 > res.unpersist() > val z = res.select("vint", "mm").filter("vint = > '2007-01-01'").select("mm").distinct.collect > z.length > >>> res1: Int = 103 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver
[ https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9836: - Assignee: Yanbo Liang > Provide R-like summary statistics for ordinary least squares via normal > equation solver > --- > > Key: SPARK-9836 > URL: https://issues.apache.org/jira/browse/SPARK-9836 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > In R, model fitting comes with summary statistics. We can provide most of > those via normal equation solver (SPARK-9834). If some statistics requires > additional passes to the dataset, we can expose an option to let users select > desired statistics before model fitting. > {code} > > summary(model) > Call: > glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris) > Deviance Residuals: > Min1QMedian3Q Max > -1.30711 -0.25713 -0.05325 0.19542 1.41253 > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 2.2514 0.3698 6.089 9.57e-09 *** > Sepal.Width 0.8036 0.1063 7.557 4.19e-12 *** > Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 *** > Speciesvirginica1.9468 0.1000 19.465 < 2e-16 *** > --- > Signif. codes: > 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > (Dispersion parameter for gaussian family taken to be 0.1918059) > Null deviance: 102.168 on 149 degrees of freedom > Residual deviance: 28.004 on 146 degrees of freedom > AIC: 183.94 > Number of Fisher Scoring iterations: 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver
[ https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978600#comment-14978600 ] Xiangrui Meng commented on SPARK-9836: -- [~yanboliang] Note that the feature freeze deadline for 1.6 is the end of the month. You can check the implementation in https://github.com/AlteryxLabs/sparkGLM and the unit tests should be verified against R lm/glm. cc [~cafreeman] and Dan Putler. > Provide R-like summary statistics for ordinary least squares via normal > equation solver > --- > > Key: SPARK-9836 > URL: https://issues.apache.org/jira/browse/SPARK-9836 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng > > In R, model fitting comes with summary statistics. We can provide most of > those via normal equation solver (SPARK-9834). If some statistics requires > additional passes to the dataset, we can expose an option to let users select > desired statistics before model fitting. > {code} > > summary(model) > Call: > glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris) > Deviance Residuals: > Min1QMedian3Q Max > -1.30711 -0.25713 -0.05325 0.19542 1.41253 > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 2.2514 0.3698 6.089 9.57e-09 *** > Sepal.Width 0.8036 0.1063 7.557 4.19e-12 *** > Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 *** > Speciesvirginica1.9468 0.1000 19.465 < 2e-16 *** > --- > Signif. codes: > 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > (Dispersion parameter for gaussian family taken to be 0.1918059) > Null deviance: 102.168 on 149 degrees of freedom > Residual deviance: 28.004 on 146 degrees of freedom > AIC: 183.94 > Number of Fisher Scoring iterations: 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11337) Make example code in user guide testable
[ https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978603#comment-14978603 ] Xiangrui Meng commented on SPARK-11337: --- How about per markdown file? I don't want to create too many sub-tasks. This is documentation work. We can make changes during the QA month for 1.6. > Make example code in user guide testable > > > Key: SPARK-11337 > URL: https://issues.apache.org/jira/browse/SPARK-11337 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Reporter: Xiangrui Meng >Assignee: Xusen Yin >Priority: Critical > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > One option I propose is to move actual example code to spark/examples and > test compilation in Jenkins builds. Then in the markdown, we can reference > part of the code to show in the user guide. This requires adding a Jekyll tag > that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code} > {% include_example scala ml.KMeansExample guide %} > {code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` > and pick code blocks marked "example" and put them under `{% highlight %}` in > the markdown. We can discuss the syntax for marker comments. > Sub-tasks are created to move example code from user guide to `examples/`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978607#comment-14978607 ] swetha k commented on SPARK-3655: - [~koert] The final output for this RDD is RDD[(String, List[(Long, String)])] . But, I call updateStateByKey on this RDD. Inside updateStateByKey, I process this list and put all the data in a single object which gets merged with the old state for this session. After the updateStateByKey, I will return objects for the session that represents the current batch and the merged batch. > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: koert kuipers >Assignee: Koert Kuipers > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11379) ExpressionEncoder can't handle top level primitive type correctly
Wenchen Fan created SPARK-11379: --- Summary: ExpressionEncoder can't handle top level primitive type correctly Key: SPARK-11379 URL: https://issues.apache.org/jira/browse/SPARK-11379 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978610#comment-14978610 ] swetha k commented on SPARK-3655: - [~koert] If I don't put the list as a materialized view in memory, what is the appropriate way to use Spark-Sorted to just group and sort the batch of Jsons based on the key(sessionId) > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: koert kuipers >Assignee: Koert Kuipers > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11380) Replace example code in mllib-frequent-pattern-mining.md using include_example
Xiangrui Meng created SPARK-11380: - Summary: Replace example code in mllib-frequent-pattern-mining.md using include_example Key: SPARK-11380 URL: https://issues.apache.org/jira/browse/SPARK-11380 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng This is similar to SPARK-11289 but for the example code in mllib-frequent-pattern-mining.md. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11337) Make example code in user guide testable
[ https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978614#comment-14978614 ] Xiangrui Meng commented on SPARK-11337: --- [~yinxusen] I created one sub-task (SPARK-11380) as a template. You can create more and remember to label them `starter`. > Make example code in user guide testable > > > Key: SPARK-11337 > URL: https://issues.apache.org/jira/browse/SPARK-11337 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Reporter: Xiangrui Meng >Assignee: Xusen Yin >Priority: Critical > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > One option I propose is to move actual example code to spark/examples and > test compilation in Jenkins builds. Then in the markdown, we can reference > part of the code to show in the user guide. This requires adding a Jekyll tag > that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code} > {% include_example scala ml.KMeansExample guide %} > {code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` > and pick code blocks marked "example" and put them under `{% highlight %}` in > the markdown. We can discuss the syntax for marker comments. > Sub-tasks are created to move example code from user guide to `examples/`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11379) ExpressionEncoder can't handle top level primitive type correctly
[ https://issues.apache.org/jira/browse/SPARK-11379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11379: Assignee: (was: Apache Spark) > ExpressionEncoder can't handle top level primitive type correctly > - > > Key: SPARK-11379 > URL: https://issues.apache.org/jira/browse/SPARK-11379 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11379) ExpressionEncoder can't handle top level primitive type correctly
[ https://issues.apache.org/jira/browse/SPARK-11379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978616#comment-14978616 ] Apache Spark commented on SPARK-11379: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/9337 > ExpressionEncoder can't handle top level primitive type correctly > - > > Key: SPARK-11379 > URL: https://issues.apache.org/jira/browse/SPARK-11379 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9836) Provide R-like summary statistics for ordinary least squares via normal equation solver
[ https://issues.apache.org/jira/browse/SPARK-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978615#comment-14978615 ] Xiangrui Meng commented on SPARK-9836: -- Sorry for late response! There are more starter tasks coming out under SPARK-11337. Are you interested? > Provide R-like summary statistics for ordinary least squares via normal > equation solver > --- > > Key: SPARK-9836 > URL: https://issues.apache.org/jira/browse/SPARK-9836 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > In R, model fitting comes with summary statistics. We can provide most of > those via normal equation solver (SPARK-9834). If some statistics requires > additional passes to the dataset, we can expose an option to let users select > desired statistics before model fitting. > {code} > > summary(model) > Call: > glm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris) > Deviance Residuals: > Min1QMedian3Q Max > -1.30711 -0.25713 -0.05325 0.19542 1.41253 > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 2.2514 0.3698 6.089 9.57e-09 *** > Sepal.Width 0.8036 0.1063 7.557 4.19e-12 *** > Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 *** > Speciesvirginica1.9468 0.1000 19.465 < 2e-16 *** > --- > Signif. codes: > 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > (Dispersion parameter for gaussian family taken to be 0.1918059) > Null deviance: 102.168 on 149 degrees of freedom > Residual deviance: 28.004 on 146 degrees of freedom > AIC: 183.94 > Number of Fisher Scoring iterations: 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11379) ExpressionEncoder can't handle top level primitive type correctly
[ https://issues.apache.org/jira/browse/SPARK-11379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11379: Assignee: Apache Spark > ExpressionEncoder can't handle top level primitive type correctly > - > > Key: SPARK-11379 > URL: https://issues.apache.org/jira/browse/SPARK-11379 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11369) SparkR glm should support setting standardize
[ https://issues.apache.org/jira/browse/SPARK-11369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11369: -- Target Version/s: 1.6.0 > SparkR glm should support setting standardize > - > > Key: SPARK-11369 > URL: https://issues.apache.org/jira/browse/SPARK-11369 > Project: Spark > Issue Type: Improvement > Components: ML, R >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 1.6.0 > > > SparkR glm currently support : > formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0 > We should also support setting standardize which has been defined at [design > documentation|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#heading=h.b2igi6cx7yf] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11369) SparkR glm should support setting standardize
[ https://issues.apache.org/jira/browse/SPARK-11369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11369: -- Assignee: Yanbo Liang > SparkR glm should support setting standardize > - > > Key: SPARK-11369 > URL: https://issues.apache.org/jira/browse/SPARK-11369 > Project: Spark > Issue Type: Improvement > Components: ML, R >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 1.6.0 > > > SparkR glm currently support : > formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0 > We should also support setting standardize which has been defined at [design > documentation|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#heading=h.b2igi6cx7yf] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11369) SparkR glm should support setting standardize
[ https://issues.apache.org/jira/browse/SPARK-11369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-11369. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9331 [https://github.com/apache/spark/pull/9331] > SparkR glm should support setting standardize > - > > Key: SPARK-11369 > URL: https://issues.apache.org/jira/browse/SPARK-11369 > Project: Spark > Issue Type: Improvement > Components: ML, R >Reporter: Yanbo Liang >Priority: Minor > Fix For: 1.6.0 > > > SparkR glm currently support : > formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0 > We should also support setting standardize which has been defined at [design > documentation|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit#heading=h.b2igi6cx7yf] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11367) Python LinearRegression should support setting solver
[ https://issues.apache.org/jira/browse/SPARK-11367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-11367. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9328 [https://github.com/apache/spark/pull/9328] > Python LinearRegression should support setting solver > - > > Key: SPARK-11367 > URL: https://issues.apache.org/jira/browse/SPARK-11367 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > Fix For: 1.6.0 > > > SPARK-10668 has provided WeightedLeastSquares solver("normal") in > LinearRegression with L2 regularization in Scala and R, Python ML > LinearRegression should also support setting solver("auto", "normal", > "l-bfgs") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org