[jira] [Comment Edited] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2018-03-23 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412379#comment-16412379
 ] 

MIN-FU YANG edited comment on SPARK-22876 at 3/24/18 3:00 AM:
--

I found that current Yarn implementation doesn't expose the number of failed 
app attempts in their API, maybe this feature should be pending until they 
expose this number.


was (Author: lucasmf):
I found that current Yarn implementation doesn't expose number of failed app 
attempts number in their API, maybe this feature should be pending until they 
expose this number.

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22876
> URL: https://issues.apache.org/jira/browse/SPARK-22876
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2018-03-23 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412379#comment-16412379
 ] 

MIN-FU YANG commented on SPARK-22876:
-

I found that current Yarn implementation doesn't expose number of failed app 
attempts number in their API, maybe this feature should be pending until they 
expose this number.

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22876
> URL: https://issues.apache.org/jira/browse/SPARK-22876
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2018-03-23 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412228#comment-16412228
 ] 

MIN-FU YANG commented on SPARK-22876:
-

Hi, I also encounter this problem.

Please assign it to me, I can fix this issue.

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22876
> URL: https://issues.apache.org/jira/browse/SPARK-22876
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16280) Implement histogram_numeric SQL function

2016-07-20 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386787#comment-15386787
 ] 

MIN-FU YANG commented on SPARK-16280:
-

I implemented histogram_numeric according to the Algorithm I & II in the paper: 
Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", 
J. Machine Learning Research 11 (2010), pp. 849--872.

After doing some benchmarking, I come out an optimal solution and ready for 
review: https://github.com/apache/spark/pull/14129

> Implement histogram_numeric SQL function
> 
>
> Key: SPARK-16280
> URL: https://issues.apache.org/jira/browse/SPARK-16280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16280) Implement histogram_numeric SQL function

2016-07-09 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15369417#comment-15369417
 ] 

MIN-FU YANG commented on SPARK-16280:
-

I'll submit PR in few days

> Implement histogram_numeric SQL function
> 
>
> Key: SPARK-16280
> URL: https://issues.apache.org/jira/browse/SPARK-16280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16282) Implement percentile SQL function

2016-06-29 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355228#comment-15355228
 ] 

MIN-FU YANG commented on SPARK-16282:
-

I can work on this, too.

> Implement percentile SQL function
> -
>
> Key: SPARK-16282
> URL: https://issues.apache.org/jira/browse/SPARK-16282
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16280) Implement histogram_numeric SQL function

2016-06-29 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355220#comment-15355220
 ] 

MIN-FU YANG commented on SPARK-16280:
-

I can working on this

> Implement histogram_numeric SQL function
> 
>
> Key: SPARK-16280
> URL: https://issues.apache.org/jira/browse/SPARK-16280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2016-06-25 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349855#comment-15349855
 ] 

MIN-FU YANG edited comment on SPARK-15516 at 6/25/16 11:23 PM:
---

Hi, [~holdenk], [~rowan.chatta...@googlemail.com]
I found a test case in DataTypeSuite.scala indicating the banning on merge on 
key with  different types.
I think it's designated behaviour. Is that right? 
But in my opinion, key with int & long type can be merged



was (Author: lucasmf):
Hi, [~holdenk], [~rowan.chatta...@googlemail.com]
I found a test case in DataTypeSuite.scala indicating the banning on merge on 
key with  different types.
I think it's designated behaviour. Is that right?

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to merge incompatible data 
> types LongType and IntegerType
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415)
> at org.apache.spark.sql.types.StructType.merge(StructType.scala:333)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829)
> {code}
> cc @rxin and [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2016-06-25 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349855#comment-15349855
 ] 

MIN-FU YANG edited comment on SPARK-15516 at 6/25/16 11:07 PM:
---

Hi, [~holdenk], [~rowan.chatta...@googlemail.com]
I found a test case in DataTypeSuite.scala indicating the banning on merge on 
key with  different types.
I think it's designated behaviour. Is that right?


was (Author: lucasmf):
[~holdenk], I found a test case in DataTypeSuite.scala indicating the banning 
on merge on key with  different types.
I think it's designated behaviour. Is that right?

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to merge incompatible data 
> types LongType and IntegerType
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415)
> at org.apache.spark.sql.types.StructType.merge(StructType.scala:333)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829)
> {code}
> cc @rxin and [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2016-06-25 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349855#comment-15349855
 ] 

MIN-FU YANG commented on SPARK-15516:
-

[~holdenk], I found a test case in DataTypeSuite.scala indicating the banning 
on merge on key with  different types.
I think it's designated behaviour. Is that right?

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to merge incompatible data 
> types LongType and IntegerType
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415)
> at org.apache.spark.sql.types.StructType.merge(StructType.scala:333)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829)
> {code}
> cc @rxin and [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2016-06-25 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349520#comment-15349520
 ] 

MIN-FU YANG edited comment on SPARK-15516 at 6/25/16 9:11 AM:
--

[~holdenk] I am wondering the banning on schema merge on different type key is 
a designated feature. 
I added following code
{code:title=StructType.scala#456-462}
case (leftNumeric: NumericType, rightNumeric: NumericType) =>
(leftNumeric, rightNumeric) match {
  case (leftIntegral: IntegralType, rightIntegral: IntegralType) =>
Seq(leftIntegral, rightIntegral).maxBy(_.defaultSize)
  case (leftFractional: FractionalType, rightFractional: 
FractionalType) =>
Seq(leftFractional, rightFractional).maxBy(_.defaultSize)
 }
{code}
to StructType.merge function and the tempView can be created successfully.
Then I query records in the tempView, it throws exceptions
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 
3, localhost): org.apache.spark.SparkException: Task failed while writing rows
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:260)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:273)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:252)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:251)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:251)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:257)
{code}

Although modify function getLong in OnHeapColumnVector to 
{code}
  @Override
  public long getLong(int rowId) {
if (dictionary == null) {
  if(longData != null)
return longData[rowId];
  else
return intData[rowId];
} else {
  return dictionary.decodeToLong(dictionaryIds.getInt(rowId));
}
  }
{code}
can solve the problem. 

But I am afraid it's not the designated behaviour.


was (Author: lucasmf):
[~holdenk] I am wondering the banning on schema merge on different type key is 
a designated feature. 
I added following code
{code:title=StructType.scala#456-462}
case (leftNumeric: NumericType, rightNumeric: NumericType) =>
(leftNumeric, rightNumeric) match {
  case (leftIntegral: IntegralType, rightIntegral: IntegralType) =>
Seq(leftIntegral, rightIntegral).maxBy(_.defaultSize)
  case (leftFractional: FractionalType, rightFractional: 
FractionalType) =>
Seq(leftFractional, rightFractional).maxBy(_.defaultSize)
 }
{code}
to StructType.merge function and the tempView can be created successfully.
Then I query records in the tempView, it throws exceptions
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 
3, localhost): org.apache.spark.SparkException: Task failed while writing rows
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:260)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)

[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2016-06-25 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15349520#comment-15349520
 ] 

MIN-FU YANG commented on SPARK-15516:
-

[~holdenk] I am wondering the banning on schema merge on different type key is 
a designated feature. 
I added following code
{code:title=StructType.scala#456-462}
case (leftNumeric: NumericType, rightNumeric: NumericType) =>
(leftNumeric, rightNumeric) match {
  case (leftIntegral: IntegralType, rightIntegral: IntegralType) =>
Seq(leftIntegral, rightIntegral).maxBy(_.defaultSize)
  case (leftFractional: FractionalType, rightFractional: 
FractionalType) =>
Seq(leftFractional, rightFractional).maxBy(_.defaultSize)
 }
{code}
to StructType.merge function and the tempView can be created successfully.
Then I query records in the tempView, it throws exceptions
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 
3, localhost): org.apache.spark.SparkException: Task failed while writing rows
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:260)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:273)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:252)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:251)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:251)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:257)
{code}

Although modify function getLong in OnHeapColumnVector to 
{code}
  @Override
  public long getLong(int rowId) {
if (dictionary == null) {
  if(longData != null)
return longData[rowId];
  else
return intData[rowId];
} else {
  return dictionary.decodeToLong(dictionaryIds.getInt(rowId));
}
  }
{code}
can solve the problem. 

But I am afraid it's not the designated behaviour.

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> 

[jira] [Comment Edited] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2016-06-25 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347465#comment-15347465
 ] 

MIN-FU YANG edited comment on SPARK-15516 at 6/25/16 8:35 AM:
--

OK, I've reproduced the problem with following code:
{code}
import org.apache.spark.sql.SparkSession

object SPARK_15516 {

  def main(args: Array[String]) {
val sqlContext = 
SparkSession.builder().master("local").appName("Spark-15516").getOrCreate()
val sc = sqlContext.sparkContext
import sqlContext.implicits._
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.write.parquet("data/test_table/key=1")
val df2 = sc.makeRDD(6 to 10).map(i => (i * 2147483647000L, i)).
  toDF("single", "triple")
df2.write.parquet("data/test_table/key=2")
val df3 = sqlContext.read.option("mergeSchema", "true").
  parquet("data/test_table")
df3.createOrReplaceTempView("df3")
println(sqlContext.sql("SELECT * FROM df3").
  collect().mkString(","))
  }
}
{code}

I'll looking into it.


was (Author: lucasmf):
OK, I've reproduced the problem with following code:
{code:scala}
import org.apache.spark.sql.SparkSession

object SPARK_15516 {

  def main(args: Array[String]) {
val sqlContext = 
SparkSession.builder().master("local").appName("Spark-15516").getOrCreate()
val sc = sqlContext.sparkContext
import sqlContext.implicits._
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.write.parquet("data/test_table/key=1")
val df2 = sc.makeRDD(6 to 10).map(i => (i * 2147483647000L, i)).
  toDF("single", "triple")
df2.write.parquet("data/test_table/key=2")
val df3 = sqlContext.read.option("mergeSchema", "true").
  parquet("data/test_table")
df3.createOrReplaceTempView("df3")
println(sqlContext.sql("SELECT * FROM df3").
  collect().mkString(","))
  }
}
{code}

I'll looking into it.

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to merge incompatible data 
> types LongType and IntegerType
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 

[jira] [Comment Edited] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2016-06-25 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347465#comment-15347465
 ] 

MIN-FU YANG edited comment on SPARK-15516 at 6/25/16 8:35 AM:
--

OK, I've reproduced the problem with following code:
{code:scala}
import org.apache.spark.sql.SparkSession

object SPARK_15516 {

  def main(args: Array[String]) {
val sqlContext = 
SparkSession.builder().master("local").appName("Spark-15516").getOrCreate()
val sc = sqlContext.sparkContext
import sqlContext.implicits._
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.write.parquet("data/test_table/key=1")
val df2 = sc.makeRDD(6 to 10).map(i => (i * 2147483647000L, i)).
  toDF("single", "triple")
df2.write.parquet("data/test_table/key=2")
val df3 = sqlContext.read.option("mergeSchema", "true").
  parquet("data/test_table")
df3.createOrReplaceTempView("df3")
println(sqlContext.sql("SELECT * FROM df3").
  collect().mkString(","))
  }
}
{code}

I'll looking into it.


was (Author: lucasmf):
OK, I've reproduced the problem with following code:
```
import sqlContext.implicits._
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.write.parquet("data/test_table/key=1")
val df2 = sc.makeRDD(6 to 10).map(i => (i * 2147483647000l, i)).toDF("single", 
"triple")
df2.write.parquet("data/test_table/key=2")
val df3 = sqlContext.read.option("mergeSchema", 
"true").parquet("data/test_table")
df3.registerTempTable("df3")
sqlContext.sql("CREATE TABLE new_key_value_store LOCATION 
'/Users/lucasmf/data/new_key_value_store' AS select * from df3");
sqlContext.sql("SELECT * FROM new_key_value_store")
```

I'll looking into it.

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to merge incompatible data 
> types LongType and IntegerType
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415)
> at org.apache.spark.sql.types.StructType.merge(StructType.scala:333)
> at 
> 

[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2016-06-23 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347465#comment-15347465
 ] 

MIN-FU YANG commented on SPARK-15516:
-

OK, I've reproduced the problem with following code:
```
import sqlContext.implicits._
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.write.parquet("data/test_table/key=1")
val df2 = sc.makeRDD(6 to 10).map(i => (i * 2147483647000l, i)).toDF("single", 
"triple")
df2.write.parquet("data/test_table/key=2")
val df3 = sqlContext.read.option("mergeSchema", 
"true").parquet("data/test_table")
df3.registerTempTable("df3")
sqlContext.sql("CREATE TABLE new_key_value_store LOCATION 
'/Users/lucasmf/data/new_key_value_store' AS select * from df3");
sqlContext.sql("SELECT * FROM new_key_value_store")
```

I'll looking into it.

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to merge incompatible data 
> types LongType and IntegerType
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415)
> at org.apache.spark.sql.types.StructType.merge(StructType.scala:333)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829)
> {code}
> cc @rxin and [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2016-06-23 Thread MIN-FU YANG (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MIN-FU YANG updated SPARK-15516:

Comment: was deleted

(was: I would like to look into this issue)

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to merge incompatible data 
> types LongType and IntegerType
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415)
> at org.apache.spark.sql.types.StructType.merge(StructType.scala:333)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829)
> {code}
> cc @rxin and [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2016-06-23 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347428#comment-15347428
 ] 

MIN-FU YANG commented on SPARK-15516:
-

Do you have any sample code?

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to merge incompatible data 
> types LongType and IntegerType
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415)
> at org.apache.spark.sql.types.StructType.merge(StructType.scala:333)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829)
> {code}
> cc @rxin and [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType

2016-06-22 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343791#comment-15343791
 ] 

MIN-FU YANG commented on SPARK-15516:
-

I would like to look into this issue

> Schema merging in driver fails for parquet when merging LongType and 
> IntegerType
> 
>
> Key: SPARK-15516
> URL: https://issues.apache.org/jira/browse/SPARK-15516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Databricks
>Reporter: Hossein Falaki
>
> I tried to create a table from partitioned parquet directories that requires 
> schema merging. I get following error:
> {code}
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to merge incompatible data 
> types LongType and IntegerType
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415)
> at org.apache.spark.sql.types.StructType.merge(StructType.scala:333)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829)
> {code}
> cc @rxin and [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16124) Throws exception when executing query on `build/sbt hive/console`

2016-06-21 Thread MIN-FU YANG (JIRA)
MIN-FU YANG created SPARK-16124:
---

 Summary: Throws exception when executing query on `build/sbt 
hive/console`
 Key: SPARK-16124
 URL: https://issues.apache.org/jira/browse/SPARK-16124
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: MIN-FU YANG
Priority: Minor


When I execute `val query = sql("SELECT * FROM src WHERE key = 92 ")` on hive 
console which is from `build/sbt hive/console`, It throws exception:

org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: 
file:/Users/xxx/git/spark/sql/hive/target/scala-2.11/spark-hive_2.11-2.0.0-SNAPSHOT.jar!/data/files/kv1.txt;
  at 
org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:242)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
  at 
org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:128)
  at 
org.apache.spark.sql.hive.test.TestHiveSparkSession$SqlCmd$$anonfun$cmd$1.apply$mcV$sp(TestHive.scala:192)
  at 
org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376)
  at 
org.apache.spark.sql.hive.test.TestHiveSparkSession$$anonfun$loadTestTable$2.apply(TestHive.scala:376)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at 
org.apache.spark.sql.hive.test.TestHiveSparkSession.loadTestTable(TestHive.scala:376)
  at 
org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462)
  at 
org.apache.spark.sql.hive.test.TestHiveQueryExecution$$anonfun$analyzed$2.apply(TestHive.scala:462)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed$lzycompute(TestHive.scala:462)
  at 
org.apache.spark.sql.hive.test.TestHiveQueryExecution.analyzed(TestHive.scala:450)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
  ... 42 elided



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-21 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343121#comment-15343121
 ] 

MIN-FU YANG commented on SPARK-14172:
-

I cannot reproduce it on 1.6.1 either. Could you give more detailed description?

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-21 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343121#comment-15343121
 ] 

MIN-FU YANG edited comment on SPARK-14172 at 6/22/16 12:48 AM:
---

I cannot reproduce it on 1.6.1 either. Could you give more detailed description 
or verify it?


was (Author: lucasmf):
I cannot reproduce it on 1.6.1 either. Could you give more detailed description?

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2016-06-21 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342917#comment-15342917
 ] 

MIN-FU YANG commented on SPARK-14172:
-

Hi, I cannot reproduce the problem in master branch. Maybe it's resolved in 
newer version. I'll try reproduce it in 1.6.1

> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15326) Doing multiple unions on a Dataframe will result in a very inefficient query plan

2016-06-21 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342850#comment-15342850
 ] 

MIN-FU YANG commented on SPARK-15326:
-

I'll look into it.

> Doing multiple unions on a Dataframe will result in a very inefficient query 
> plan
> -
>
> Key: SPARK-15326
> URL: https://issues.apache.org/jira/browse/SPARK-15326
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Jurriaan Pruis
> Attachments: Query Plan.pdf, skewed_join.py, skewed_join_plan.txt
>
>
> While working with a very skewed dataset I noticed that repeated unions on a 
> dataframe will result in a query plan with 2^(union) - 1 unions. With large 
> datasets this will be very inefficient.
> I tried to replicate this behaviour using a PySpark example (I've attached 
> the output of the explain() to this JIRA):
> {code}
> df = sqlCtx.range(1000)
> def r(name, max_val=100):
> return F.round(F.lit(max_val) * F.pow(F.rand(), 
> 4)).cast('integer').alias(name)
> # Create a skewed dataset
> skewed = df.select('id', r('a'), r('b'), r('c'), r('d'), r('e'), r('f'))
> # Find the skewed values in the dataset
> top_10_percent = skewed.freqItems(['a', 'b', 'c', 'd', 'e', 'f'], 
> 0.10).collect()[0]
> def skewjoin(skewed, right, column, freqItems):
> freqItems = freqItems[column + '_freqItems']
> skewed = skewed.alias('skewed')
> cond = F.col(column).isin(freqItems)
> # First broadcast join the frequent (skewed) values
> filtered = skewed.filter(cond).join(F.broadcast(right.filter(cond)), 
> column, 'left_outer')
> # Use a regular join for the non skewed values (with big tables this will 
> use a SortMergeJoin)
> non_skewed = skewed.filter(cond == False).join(right.filter(cond == 
> False), column, 'left_outer')
> # join them together and replace the column with the column found in the 
> right DataFrame
> return filtered.unionAll(non_skewed).select('skewed.*', 
> right['id'].alias(column + '_key')).drop(column)
> # Create the dataframes that will be joined to the skewed dataframe
> right_size = 100
> df_a = sqlCtx.range(right_size).select('id', F.col('id').alias('a'))
> df_b = sqlCtx.range(right_size).select('id', F.col('id').alias('b'))
> df_c = sqlCtx.range(right_size).select('id', F.col('id').alias('c'))
> df_d = sqlCtx.range(right_size).select('id', F.col('id').alias('d'))
> df_e = sqlCtx.range(right_size).select('id', F.col('id').alias('e'))
> df_f = sqlCtx.range(right_size).select('id', F.col('id').alias('f'))
> # Join everything together
> df = skewed
> df = skewjoin(df, df_a, 'a', top_10_percent)
> df = skewjoin(df, df_b, 'b', top_10_percent)
> df = skewjoin(df, df_c, 'c', top_10_percent)
> df = skewjoin(df, df_d, 'd', top_10_percent)
> df = skewjoin(df, df_e, 'e', top_10_percent)
> df = skewjoin(df, df_f, 'f', top_10_percent)
> # df.explain() shows the plan where it does 63 unions 
> (2^(number_of_skewjoins) - 1)
> # which will be very inefficient and slow
> df.explain(True)
> # Evaluate the plan
> # You'd expect this to return 1000, but it does not, it returned 1140 
> on my system
> # (probably because it will recalculate the random columns? Not sure though)
> print(df.count())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation

2016-06-15 Thread MIN-FU YANG (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MIN-FU YANG updated SPARK-15906:

Description: 
Improve the Naive Bayes algorithm on skew data according to 
"Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

Mahout & WEKA both have Complementary Naive Bayes implementations.

https://mahout.apache.org/users/classification/bayesian.html
http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html

Besides, this paper is referenced by other papers & books 600+ times, I think 
it's result is solid.
https://scholar.google.com.tw/scholar?rlz=1C5CHFA_enTW567TW567=high=1=UTF-8=1197073324019480518

  was:
Improve the Naive Bayes algorithm on skew data according to 
"Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

Mahout & WEKA both have Complementary Naive Bayes implementations.

https://mahout.apache.org/users/classification/bayesian.html
http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html

Besides, this paper is referenced by other papers & books 600+ times, I think 
it's result is solid.


> Complementary Naive Bayes Algorithm Implementation
> --
>
> Key: SPARK-15906
> URL: https://issues.apache.org/jira/browse/SPARK-15906
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: MIN-FU YANG
>Priority: Minor
>
> Improve the Naive Bayes algorithm on skew data according to 
> "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
> Mahout & WEKA both have Complementary Naive Bayes implementations.
> https://mahout.apache.org/users/classification/bayesian.html
> http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html
> Besides, this paper is referenced by other papers & books 600+ times, I think 
> it's result is solid.
> https://scholar.google.com.tw/scholar?rlz=1C5CHFA_enTW567TW567=high=1=UTF-8=1197073324019480518



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation

2016-06-15 Thread MIN-FU YANG (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MIN-FU YANG updated SPARK-15906:

Description: 
Improve the Naive Bayes algorithm on skew data according to 
"Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

Mahout & WEKA both have Complementary Naive Bayes implementations.

https://mahout.apache.org/users/classification/bayesian.html
http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html

Besides, this paper is referenced by other papers & books 600+ times, I think 
it's result is solid.

  was:
Improve the Naive Bayes algorithm on skew data according to 
"Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf


> Complementary Naive Bayes Algorithm Implementation
> --
>
> Key: SPARK-15906
> URL: https://issues.apache.org/jira/browse/SPARK-15906
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: MIN-FU YANG
>Priority: Minor
>
> Improve the Naive Bayes algorithm on skew data according to 
> "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
> Mahout & WEKA both have Complementary Naive Bayes implementations.
> https://mahout.apache.org/users/classification/bayesian.html
> http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html
> Besides, this paper is referenced by other papers & books 600+ times, I think 
> it's result is solid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation

2016-06-15 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333098#comment-15333098
 ] 

MIN-FU YANG commented on SPARK-15906:
-

OK, the description is updated.

> Complementary Naive Bayes Algorithm Implementation
> --
>
> Key: SPARK-15906
> URL: https://issues.apache.org/jira/browse/SPARK-15906
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: MIN-FU YANG
>Priority: Minor
>
> Improve the Naive Bayes algorithm on skew data according to 
> "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
> Mahout & WEKA both have Complementary Naive Bayes implementations.
> https://mahout.apache.org/users/classification/bayesian.html
> http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html
> Besides, this paper is referenced by other papers & books 600+ times, I think 
> it's result is solid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation

2016-06-12 Thread MIN-FU YANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326685#comment-15326685
 ] 

MIN-FU YANG commented on SPARK-15906:
-

tilumi has pr: https://github.com/apache/spark/pull/13627 for this issue

> Complementary Naive Bayes Algorithm Implementation
> --
>
> Key: SPARK-15906
> URL: https://issues.apache.org/jira/browse/SPARK-15906
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: MIN-FU YANG
>
> Improve the Naive Bayes algorithm on skew data according to 
> "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation

2016-06-12 Thread MIN-FU YANG (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MIN-FU YANG updated SPARK-15906:

Comment: was deleted

(was: tilumi has pr: https://github.com/apache/spark/pull/13627 for this issue)

> Complementary Naive Bayes Algorithm Implementation
> --
>
> Key: SPARK-15906
> URL: https://issues.apache.org/jira/browse/SPARK-15906
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: MIN-FU YANG
>
> Improve the Naive Bayes algorithm on skew data according to 
> "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation

2016-06-12 Thread MIN-FU YANG (JIRA)
MIN-FU YANG created SPARK-15906:
---

 Summary: Complementary Naive Bayes Algorithm Implementation
 Key: SPARK-15906
 URL: https://issues.apache.org/jira/browse/SPARK-15906
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: MIN-FU YANG


Improve the Naive Bayes algorithm on skew data according to 
"Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org