[jira] [Comment Edited] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly
[ https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412379#comment-16412379 ] MIN-FU YANG edited comment on SPARK-22876 at 3/24/18 3:00 AM: -- I found that current Yarn implementation doesn't expose the number of failed app attempts in their API, maybe this feature should be pending until they expose this number. was (Author: lucasmf): I found that current Yarn implementation doesn't expose number of failed app attempts number in their API, maybe this feature should be pending until they expose this number. > spark.yarn.am.attemptFailuresValidityInterval does not work correctly > - > > Key: SPARK-22876 > URL: https://issues.apache.org/jira/browse/SPARK-22876 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 > Environment: hadoop version 2.7.3 >Reporter: Jinhan Zhong >Priority: Minor > > I assume we can use spark.yarn.maxAppAttempts together with > spark.yarn.am.attemptFailuresValidityInterval to make a long running > application avoid stopping after acceptable number of failures. > But after testing, I found that the application always stops after failing n > times ( n is minimum value of spark.yarn.maxAppAttempts and > yarn.resourcemanager.am.max-attempts from client yarn-site.xml) > for example, following setup will allow the application master to fail 20 > times. > * spark.yarn.am.attemptFailuresValidityInterval=1s > * spark.yarn.maxAppAttempts=20 > * yarn client: yarn.resourcemanager.am.max-attempts=20 > * yarn resource manager: yarn.resourcemanager.am.max-attempts=3 > And after checking the source code, I found in source file > ApplicationMaster.scala > https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293 > there's a ShutdownHook that checks the attempt id against the maxAppAttempts, > if attempt id >= maxAppAttempts, it will try to unregister the application > and the application will finish. > is this a expected design or a bug? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly
[ https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412379#comment-16412379 ] MIN-FU YANG commented on SPARK-22876: - I found that current Yarn implementation doesn't expose number of failed app attempts number in their API, maybe this feature should be pending until they expose this number. > spark.yarn.am.attemptFailuresValidityInterval does not work correctly > - > > Key: SPARK-22876 > URL: https://issues.apache.org/jira/browse/SPARK-22876 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 > Environment: hadoop version 2.7.3 >Reporter: Jinhan Zhong >Priority: Minor > > I assume we can use spark.yarn.maxAppAttempts together with > spark.yarn.am.attemptFailuresValidityInterval to make a long running > application avoid stopping after acceptable number of failures. > But after testing, I found that the application always stops after failing n > times ( n is minimum value of spark.yarn.maxAppAttempts and > yarn.resourcemanager.am.max-attempts from client yarn-site.xml) > for example, following setup will allow the application master to fail 20 > times. > * spark.yarn.am.attemptFailuresValidityInterval=1s > * spark.yarn.maxAppAttempts=20 > * yarn client: yarn.resourcemanager.am.max-attempts=20 > * yarn resource manager: yarn.resourcemanager.am.max-attempts=3 > And after checking the source code, I found in source file > ApplicationMaster.scala > https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293 > there's a ShutdownHook that checks the attempt id against the maxAppAttempts, > if attempt id >= maxAppAttempts, it will try to unregister the application > and the application will finish. > is this a expected design or a bug? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly
[ https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412228#comment-16412228 ] MIN-FU YANG commented on SPARK-22876: - Hi, I also encounter this problem. Please assign it to me, I can fix this issue. > spark.yarn.am.attemptFailuresValidityInterval does not work correctly > - > > Key: SPARK-22876 > URL: https://issues.apache.org/jira/browse/SPARK-22876 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 > Environment: hadoop version 2.7.3 >Reporter: Jinhan Zhong >Priority: Minor > > I assume we can use spark.yarn.maxAppAttempts together with > spark.yarn.am.attemptFailuresValidityInterval to make a long running > application avoid stopping after acceptable number of failures. > But after testing, I found that the application always stops after failing n > times ( n is minimum value of spark.yarn.maxAppAttempts and > yarn.resourcemanager.am.max-attempts from client yarn-site.xml) > for example, following setup will allow the application master to fail 20 > times. > * spark.yarn.am.attemptFailuresValidityInterval=1s > * spark.yarn.maxAppAttempts=20 > * yarn client: yarn.resourcemanager.am.max-attempts=20 > * yarn resource manager: yarn.resourcemanager.am.max-attempts=3 > And after checking the source code, I found in source file > ApplicationMaster.scala > https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293 > there's a ShutdownHook that checks the attempt id against the maxAppAttempts, > if attempt id >= maxAppAttempts, it will try to unregister the application > and the application will finish. > is this a expected design or a bug? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16280) Implement histogram_numeric SQL function
[ https://issues.apache.org/jira/browse/SPARK-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386787#comment-15386787 ] MIN-FU YANG commented on SPARK-16280: - I implemented histogram_numeric according to the Algorithm I & II in the paper: Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872. After doing some benchmarking, I come out an optimal solution and ready for review: https://github.com/apache/spark/pull/14129 > Implement histogram_numeric SQL function > > > Key: SPARK-16280 > URL: https://issues.apache.org/jira/browse/SPARK-16280 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16280) Implement histogram_numeric SQL function
[ https://issues.apache.org/jira/browse/SPARK-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15369417#comment-15369417 ] MIN-FU YANG commented on SPARK-16280: - I'll submit PR in few days > Implement histogram_numeric SQL function > > > Key: SPARK-16280 > URL: https://issues.apache.org/jira/browse/SPARK-16280 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16282) Implement percentile SQL function
[ https://issues.apache.org/jira/browse/SPARK-16282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355228#comment-15355228 ] MIN-FU YANG commented on SPARK-16282: - I can work on this, too. > Implement percentile SQL function > - > > Key: SPARK-16282 > URL: https://issues.apache.org/jira/browse/SPARK-16282 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16280) Implement histogram_numeric SQL function
[ https://issues.apache.org/jira/browse/SPARK-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355220#comment-15355220 ] MIN-FU YANG commented on SPARK-16280: - I can working on this > Implement histogram_numeric SQL function > > > Key: SPARK-16280 > URL: https://issues.apache.org/jira/browse/SPARK-16280 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349855#comment-15349855 ] MIN-FU YANG edited comment on SPARK-15516 at 6/25/16 11:23 PM: --- Hi, [~holdenk], [~rowan.chatta...@googlemail.com] I found a test case in DataTypeSuite.scala indicating the banning on merge on key with different types. I think it's designated behaviour. Is that right? But in my opinion, key with int & long type can be merged was (Author: lucasmf): Hi, [~holdenk], [~rowan.chatta...@googlemail.com] I found a test case in DataTypeSuite.scala indicating the banning on merge on key with different types. I think it's designated behaviour. Is that right? > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to merge incompatible data > types LongType and IntegerType > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:333) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829) > {code} > cc @rxin and [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349855#comment-15349855 ] MIN-FU YANG edited comment on SPARK-15516 at 6/25/16 11:07 PM: --- Hi, [~holdenk], [~rowan.chatta...@googlemail.com] I found a test case in DataTypeSuite.scala indicating the banning on merge on key with different types. I think it's designated behaviour. Is that right? was (Author: lucasmf): [~holdenk], I found a test case in DataTypeSuite.scala indicating the banning on merge on key with different types. I think it's designated behaviour. Is that right? > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to merge incompatible data > types LongType and IntegerType > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:333) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829) > {code} > cc @rxin and [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349855#comment-15349855 ] MIN-FU YANG commented on SPARK-15516: - [~holdenk], I found a test case in DataTypeSuite.scala indicating the banning on merge on key with different types. I think it's designated behaviour. Is that right? > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to merge incompatible data > types LongType and IntegerType > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:333) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829) > {code} > cc @rxin and [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349520#comment-15349520 ] MIN-FU YANG edited comment on SPARK-15516 at 6/25/16 9:11 AM: -- [~holdenk] I am wondering the banning on schema merge on different type key is a designated feature. I added following code {code:title=StructType.scala#456-462} case (leftNumeric: NumericType, rightNumeric: NumericType) => (leftNumeric, rightNumeric) match { case (leftIntegral: IntegralType, rightIntegral: IntegralType) => Seq(leftIntegral, rightIntegral).maxBy(_.defaultSize) case (leftFractional: FractionalType, rightFractional: FractionalType) => Seq(leftFractional, rightFractional).maxBy(_.defaultSize) } {code} to StructType.merge function and the tempView can be created successfully. Then I query records in the tempView, it throws exceptions {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:260) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:273) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:252) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:251) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:251) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:257) {code} Although modify function getLong in OnHeapColumnVector to {code} @Override public long getLong(int rowId) { if (dictionary == null) { if(longData != null) return longData[rowId]; else return intData[rowId]; } else { return dictionary.decodeToLong(dictionaryIds.getInt(rowId)); } } {code} can solve the problem. But I am afraid it's not the designated behaviour. was (Author: lucasmf): [~holdenk] I am wondering the banning on schema merge on different type key is a designated feature. I added following code {code:title=StructType.scala#456-462} case (leftNumeric: NumericType, rightNumeric: NumericType) => (leftNumeric, rightNumeric) match { case (leftIntegral: IntegralType, rightIntegral: IntegralType) => Seq(leftIntegral, rightIntegral).maxBy(_.defaultSize) case (leftFractional: FractionalType, rightFractional: FractionalType) => Seq(leftFractional, rightFractional).maxBy(_.defaultSize) } {code} to StructType.merge function and the tempView can be created successfully. Then I query records in the tempView, it throws exceptions {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:260) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349520#comment-15349520 ] MIN-FU YANG commented on SPARK-15516: - [~holdenk] I am wondering the banning on schema merge on different type key is a designated feature. I added following code {code:title=StructType.scala#456-462} case (leftNumeric: NumericType, rightNumeric: NumericType) => (leftNumeric, rightNumeric) match { case (leftIntegral: IntegralType, rightIntegral: IntegralType) => Seq(leftIntegral, rightIntegral).maxBy(_.defaultSize) case (leftFractional: FractionalType, rightFractional: FractionalType) => Seq(leftFractional, rightFractional).maxBy(_.defaultSize) } {code} to StructType.merge function and the tempView can be created successfully. Then I query records in the tempView, it throws exceptions {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:260) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:273) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:252) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:251) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:251) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:257) {code} Although modify function getLong in OnHeapColumnVector to {code} @Override public long getLong(int rowId) { if (dictionary == null) { if(longData != null) return longData[rowId]; else return intData[rowId]; } else { return dictionary.decodeToLong(dictionaryIds.getInt(rowId)); } } {code} can solve the problem. But I am afraid it's not the designated behaviour. > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at >
[jira] [Comment Edited] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347465#comment-15347465 ] MIN-FU YANG edited comment on SPARK-15516 at 6/25/16 8:35 AM: -- OK, I've reproduced the problem with following code: {code} import org.apache.spark.sql.SparkSession object SPARK_15516 { def main(args: Array[String]) { val sqlContext = SparkSession.builder().master("local").appName("Spark-15516").getOrCreate() val sc = sqlContext.sparkContext import sqlContext.implicits._ val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double") df1.write.parquet("data/test_table/key=1") val df2 = sc.makeRDD(6 to 10).map(i => (i * 2147483647000L, i)). toDF("single", "triple") df2.write.parquet("data/test_table/key=2") val df3 = sqlContext.read.option("mergeSchema", "true"). parquet("data/test_table") df3.createOrReplaceTempView("df3") println(sqlContext.sql("SELECT * FROM df3"). collect().mkString(",")) } } {code} I'll looking into it. was (Author: lucasmf): OK, I've reproduced the problem with following code: {code:scala} import org.apache.spark.sql.SparkSession object SPARK_15516 { def main(args: Array[String]) { val sqlContext = SparkSession.builder().master("local").appName("Spark-15516").getOrCreate() val sc = sqlContext.sparkContext import sqlContext.implicits._ val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double") df1.write.parquet("data/test_table/key=1") val df2 = sc.makeRDD(6 to 10).map(i => (i * 2147483647000L, i)). toDF("single", "triple") df2.write.parquet("data/test_table/key=2") val df3 = sqlContext.read.option("mergeSchema", "true"). parquet("data/test_table") df3.createOrReplaceTempView("df3") println(sqlContext.sql("SELECT * FROM df3"). collect().mkString(",")) } } {code} I'll looking into it. > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to merge incompatible data > types LongType and IntegerType > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at
[jira] [Comment Edited] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347465#comment-15347465 ] MIN-FU YANG edited comment on SPARK-15516 at 6/25/16 8:35 AM: -- OK, I've reproduced the problem with following code: {code:scala} import org.apache.spark.sql.SparkSession object SPARK_15516 { def main(args: Array[String]) { val sqlContext = SparkSession.builder().master("local").appName("Spark-15516").getOrCreate() val sc = sqlContext.sparkContext import sqlContext.implicits._ val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double") df1.write.parquet("data/test_table/key=1") val df2 = sc.makeRDD(6 to 10).map(i => (i * 2147483647000L, i)). toDF("single", "triple") df2.write.parquet("data/test_table/key=2") val df3 = sqlContext.read.option("mergeSchema", "true"). parquet("data/test_table") df3.createOrReplaceTempView("df3") println(sqlContext.sql("SELECT * FROM df3"). collect().mkString(",")) } } {code} I'll looking into it. was (Author: lucasmf): OK, I've reproduced the problem with following code: ``` import sqlContext.implicits._ val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double") df1.write.parquet("data/test_table/key=1") val df2 = sc.makeRDD(6 to 10).map(i => (i * 2147483647000l, i)).toDF("single", "triple") df2.write.parquet("data/test_table/key=2") val df3 = sqlContext.read.option("mergeSchema", "true").parquet("data/test_table") df3.registerTempTable("df3") sqlContext.sql("CREATE TABLE new_key_value_store LOCATION '/Users/lucasmf/data/new_key_value_store' AS select * from df3"); sqlContext.sql("SELECT * FROM new_key_value_store") ``` I'll looking into it. > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to merge incompatible data > types LongType and IntegerType > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:333) > at >
[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347465#comment-15347465 ] MIN-FU YANG commented on SPARK-15516: - OK, I've reproduced the problem with following code: ``` import sqlContext.implicits._ val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double") df1.write.parquet("data/test_table/key=1") val df2 = sc.makeRDD(6 to 10).map(i => (i * 2147483647000l, i)).toDF("single", "triple") df2.write.parquet("data/test_table/key=2") val df3 = sqlContext.read.option("mergeSchema", "true").parquet("data/test_table") df3.registerTempTable("df3") sqlContext.sql("CREATE TABLE new_key_value_store LOCATION '/Users/lucasmf/data/new_key_value_store' AS select * from df3"); sqlContext.sql("SELECT * FROM new_key_value_store") ``` I'll looking into it. > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to merge incompatible data > types LongType and IntegerType > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:333) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829) > {code} > cc @rxin and [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MIN-FU YANG updated SPARK-15516: Comment: was deleted (was: I would like to look into this issue) > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to merge incompatible data > types LongType and IntegerType > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:333) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829) > {code} > cc @rxin and [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347428#comment-15347428 ] MIN-FU YANG commented on SPARK-15516: - Do you have any sample code? > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to merge incompatible data > types LongType and IntegerType > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:333) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829) > {code} > cc @rxin and [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343791#comment-15343791 ] MIN-FU YANG commented on SPARK-15516: - I would like to look into this issue > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to merge incompatible data > types LongType and IntegerType > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:333) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829) > {code} > cc @rxin and [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16124) Throws exception when executing query on `build/sbt hive/console`
MIN-FU YANG created SPARK-16124: --- Summary: Throws exception when executing query on `build/sbt hive/console` Key: SPARK-16124 URL: https://issues.apache.org/jira/browse/SPARK-16124 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: MIN-FU YANG Priority: Minor When I execute `val query = sql("SELECT * FROM src WHERE key = 92 ")` on hive console which is from `build/sbt hive/console`, It throws exception: org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: file:/Users/xxx/git/spark/sql/hive/target/scala-2.11/spark-hive_2.11-2.0.0-SNAPSHOT.jar!/data/files/kv1.txt; at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:242) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:128) at org.apache.spark.sql.hive.test.TestHiveSparkSession$SqlCmd$$anonfun$cmd$1.apply$mcV$sp(TestHive.scala:192) at org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376) at org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at org.apache.spark.sql.hive.test.TestHiveSparkSession.loadTestTable(TestHive.scala:376) at org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462) at org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed$lzycompute(TestHive.scala:462) at org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed(TestHive.scala:450) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682) ... 42 elided -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly
[ https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343121#comment-15343121 ] MIN-FU YANG commented on SPARK-14172: - I cannot reproduce it on 1.6.1 either. Could you give more detailed description? > Hive table partition predicate not passed down correctly > > > Key: SPARK-14172 > URL: https://issues.apache.org/jira/browse/SPARK-14172 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yingji Zhang >Priority: Critical > > When the hive sql contains nondeterministic fields, spark plan will not push > down the partition predicate to the HiveTableScan. For example: > {code} > -- consider following query which uses a random function to sample rows > SELECT * > FROM table_a > WHERE partition_col = 'some_value' > AND rand() < 0.01; > {code} > The spark plan will not push down the partition predicate to HiveTableScan > which ends up scanning all partitions data from the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14172) Hive table partition predicate not passed down correctly
[ https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343121#comment-15343121 ] MIN-FU YANG edited comment on SPARK-14172 at 6/22/16 12:48 AM: --- I cannot reproduce it on 1.6.1 either. Could you give more detailed description or verify it? was (Author: lucasmf): I cannot reproduce it on 1.6.1 either. Could you give more detailed description? > Hive table partition predicate not passed down correctly > > > Key: SPARK-14172 > URL: https://issues.apache.org/jira/browse/SPARK-14172 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yingji Zhang >Priority: Critical > > When the hive sql contains nondeterministic fields, spark plan will not push > down the partition predicate to the HiveTableScan. For example: > {code} > -- consider following query which uses a random function to sample rows > SELECT * > FROM table_a > WHERE partition_col = 'some_value' > AND rand() < 0.01; > {code} > The spark plan will not push down the partition predicate to HiveTableScan > which ends up scanning all partitions data from the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly
[ https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342917#comment-15342917 ] MIN-FU YANG commented on SPARK-14172: - Hi, I cannot reproduce the problem in master branch. Maybe it's resolved in newer version. I'll try reproduce it in 1.6.1 > Hive table partition predicate not passed down correctly > > > Key: SPARK-14172 > URL: https://issues.apache.org/jira/browse/SPARK-14172 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yingji Zhang >Priority: Critical > > When the hive sql contains nondeterministic fields, spark plan will not push > down the partition predicate to the HiveTableScan. For example: > {code} > -- consider following query which uses a random function to sample rows > SELECT * > FROM table_a > WHERE partition_col = 'some_value' > AND rand() < 0.01; > {code} > The spark plan will not push down the partition predicate to HiveTableScan > which ends up scanning all partitions data from the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15326) Doing multiple unions on a Dataframe will result in a very inefficient query plan
[ https://issues.apache.org/jira/browse/SPARK-15326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342850#comment-15342850 ] MIN-FU YANG commented on SPARK-15326: - I'll look into it. > Doing multiple unions on a Dataframe will result in a very inefficient query > plan > - > > Key: SPARK-15326 > URL: https://issues.apache.org/jira/browse/SPARK-15326 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Jurriaan Pruis > Attachments: Query Plan.pdf, skewed_join.py, skewed_join_plan.txt > > > While working with a very skewed dataset I noticed that repeated unions on a > dataframe will result in a query plan with 2^(union) - 1 unions. With large > datasets this will be very inefficient. > I tried to replicate this behaviour using a PySpark example (I've attached > the output of the explain() to this JIRA): > {code} > df = sqlCtx.range(1000) > def r(name, max_val=100): > return F.round(F.lit(max_val) * F.pow(F.rand(), > 4)).cast('integer').alias(name) > # Create a skewed dataset > skewed = df.select('id', r('a'), r('b'), r('c'), r('d'), r('e'), r('f')) > # Find the skewed values in the dataset > top_10_percent = skewed.freqItems(['a', 'b', 'c', 'd', 'e', 'f'], > 0.10).collect()[0] > def skewjoin(skewed, right, column, freqItems): > freqItems = freqItems[column + '_freqItems'] > skewed = skewed.alias('skewed') > cond = F.col(column).isin(freqItems) > # First broadcast join the frequent (skewed) values > filtered = skewed.filter(cond).join(F.broadcast(right.filter(cond)), > column, 'left_outer') > # Use a regular join for the non skewed values (with big tables this will > use a SortMergeJoin) > non_skewed = skewed.filter(cond == False).join(right.filter(cond == > False), column, 'left_outer') > # join them together and replace the column with the column found in the > right DataFrame > return filtered.unionAll(non_skewed).select('skewed.*', > right['id'].alias(column + '_key')).drop(column) > # Create the dataframes that will be joined to the skewed dataframe > right_size = 100 > df_a = sqlCtx.range(right_size).select('id', F.col('id').alias('a')) > df_b = sqlCtx.range(right_size).select('id', F.col('id').alias('b')) > df_c = sqlCtx.range(right_size).select('id', F.col('id').alias('c')) > df_d = sqlCtx.range(right_size).select('id', F.col('id').alias('d')) > df_e = sqlCtx.range(right_size).select('id', F.col('id').alias('e')) > df_f = sqlCtx.range(right_size).select('id', F.col('id').alias('f')) > # Join everything together > df = skewed > df = skewjoin(df, df_a, 'a', top_10_percent) > df = skewjoin(df, df_b, 'b', top_10_percent) > df = skewjoin(df, df_c, 'c', top_10_percent) > df = skewjoin(df, df_d, 'd', top_10_percent) > df = skewjoin(df, df_e, 'e', top_10_percent) > df = skewjoin(df, df_f, 'f', top_10_percent) > # df.explain() shows the plan where it does 63 unions > (2^(number_of_skewjoins) - 1) > # which will be very inefficient and slow > df.explain(True) > # Evaluate the plan > # You'd expect this to return 1000, but it does not, it returned 1140 > on my system > # (probably because it will recalculate the random columns? Not sure though) > print(df.count()) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation
[ https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MIN-FU YANG updated SPARK-15906: Description: Improve the Naive Bayes algorithm on skew data according to "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf Mahout & WEKA both have Complementary Naive Bayes implementations. https://mahout.apache.org/users/classification/bayesian.html http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html Besides, this paper is referenced by other papers & books 600+ times, I think it's result is solid. https://scholar.google.com.tw/scholar?rlz=1C5CHFA_enTW567TW567=high=1=UTF-8=1197073324019480518 was: Improve the Naive Bayes algorithm on skew data according to "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf Mahout & WEKA both have Complementary Naive Bayes implementations. https://mahout.apache.org/users/classification/bayesian.html http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html Besides, this paper is referenced by other papers & books 600+ times, I think it's result is solid. > Complementary Naive Bayes Algorithm Implementation > -- > > Key: SPARK-15906 > URL: https://issues.apache.org/jira/browse/SPARK-15906 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: MIN-FU YANG >Priority: Minor > > Improve the Naive Bayes algorithm on skew data according to > "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 > http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf > Mahout & WEKA both have Complementary Naive Bayes implementations. > https://mahout.apache.org/users/classification/bayesian.html > http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html > Besides, this paper is referenced by other papers & books 600+ times, I think > it's result is solid. > https://scholar.google.com.tw/scholar?rlz=1C5CHFA_enTW567TW567=high=1=UTF-8=1197073324019480518 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation
[ https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MIN-FU YANG updated SPARK-15906: Description: Improve the Naive Bayes algorithm on skew data according to "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf Mahout & WEKA both have Complementary Naive Bayes implementations. https://mahout.apache.org/users/classification/bayesian.html http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html Besides, this paper is referenced by other papers & books 600+ times, I think it's result is solid. was: Improve the Naive Bayes algorithm on skew data according to "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf > Complementary Naive Bayes Algorithm Implementation > -- > > Key: SPARK-15906 > URL: https://issues.apache.org/jira/browse/SPARK-15906 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: MIN-FU YANG >Priority: Minor > > Improve the Naive Bayes algorithm on skew data according to > "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 > http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf > Mahout & WEKA both have Complementary Naive Bayes implementations. > https://mahout.apache.org/users/classification/bayesian.html > http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html > Besides, this paper is referenced by other papers & books 600+ times, I think > it's result is solid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation
[ https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333098#comment-15333098 ] MIN-FU YANG commented on SPARK-15906: - OK, the description is updated. > Complementary Naive Bayes Algorithm Implementation > -- > > Key: SPARK-15906 > URL: https://issues.apache.org/jira/browse/SPARK-15906 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: MIN-FU YANG >Priority: Minor > > Improve the Naive Bayes algorithm on skew data according to > "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 > http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf > Mahout & WEKA both have Complementary Naive Bayes implementations. > https://mahout.apache.org/users/classification/bayesian.html > http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html > Besides, this paper is referenced by other papers & books 600+ times, I think > it's result is solid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation
[ https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326685#comment-15326685 ] MIN-FU YANG commented on SPARK-15906: - tilumi has pr: https://github.com/apache/spark/pull/13627 for this issue > Complementary Naive Bayes Algorithm Implementation > -- > > Key: SPARK-15906 > URL: https://issues.apache.org/jira/browse/SPARK-15906 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: MIN-FU YANG > > Improve the Naive Bayes algorithm on skew data according to > "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 > http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation
[ https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MIN-FU YANG updated SPARK-15906: Comment: was deleted (was: tilumi has pr: https://github.com/apache/spark/pull/13627 for this issue) > Complementary Naive Bayes Algorithm Implementation > -- > > Key: SPARK-15906 > URL: https://issues.apache.org/jira/browse/SPARK-15906 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: MIN-FU YANG > > Improve the Naive Bayes algorithm on skew data according to > "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 > http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation
MIN-FU YANG created SPARK-15906: --- Summary: Complementary Naive Bayes Algorithm Implementation Key: SPARK-15906 URL: https://issues.apache.org/jira/browse/SPARK-15906 Project: Spark Issue Type: New Feature Components: MLlib Reporter: MIN-FU YANG Improve the Naive Bayes algorithm on skew data according to "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2 http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org