[jira] [Comment Edited] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error
[ https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547337#comment-16547337 ] zenglinxi edited comment on SPARK-24809 at 7/18/18 4:10 AM: [^Spark LongHashedRelation serialization.svg] I think it's a hidden but critical bug that may cause data error. {code:java} // code in HashedRelation.scala private def write( writeBoolean: (Boolean) => Unit, writeLong: (Long) => Unit, writeBuffer: (Array[Byte], Int, Int) => Unit): Unit = { writeBoolean(isDense) writeLong(minKey) writeLong(maxKey) writeLong(numKeys) writeLong(numValues) writeLong(numKeyLookups) writeLong(numProbes) writeLong(array.length) writeLongArray(writeBuffer, array, array.length) val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt writeLong(used) writeLongArray(writeBuffer, page, used) } {code} This write func in HashedRelation.scala will be called when executor didn't have enough memory for the LongToUnsafeRowMap in which the data of broadcast table been saved, however, the value of cursor in executor may not changed after initialization by {code:java} // code placeholder private var cursor: Long = Platform.LONG_ARRAY_OFFSET {code} which makes the value of "used" in write func been zero when write to disk, then in the case of deserializing this data in disk will get wrong pointer. Finally, we may get the wrong data from broadcast join. was (Author: gostop_zlx): [^Spark LongHashedRelation serialization.svg] I think it's a hidden but critical bug that may cause data error. {code:java} // code in HashedRelation.scala private def write( writeBoolean: (Boolean) => Unit, writeLong: (Long) => Unit, writeBuffer: (Array[Byte], Int, Int) => Unit): Unit = { writeBoolean(isDense) writeLong(minKey) writeLong(maxKey) writeLong(numKeys) writeLong(numValues) writeLong(numKeyLookups) writeLong(numProbes) writeLong(array.length) writeLongArray(writeBuffer, array, array.length) val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt writeLong(used) writeLongArray(writeBuffer, page, used) } {code} This write func in HashedRelation.scala will be called when executor didn't have enough memory for the LongToUnsafeRowMap in which the data of broadcast table been saved, however, the value of cursor in executor may not changed after initialization by {code:java} // code placeholder private var cursor: Long = Platform.LONG_ARRAY_OFFSET {code} which makes the value of "used" in write func been zero when write to disk, then in the case of deserializing this data in disk will get wrong pointer. Finally, we may get the wrong data from broadcast join. > Serializing LongHashedRelation in executor may result in data error > --- > > Key: SPARK-24809 > URL: https://issues.apache.org/jira/browse/SPARK-24809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 > Environment: Spark 2.2.1 > hadoop 2.7.1 >Reporter: Lijia Liu >Priority: Critical > Attachments: Spark LongHashedRelation serialization.svg > > > When join key is long or int in broadcast join, Spark will use > LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if > the broadcast value is abnormal big, executor will serialize it to disk. But, > data will lost when serializing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error
[ https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547337#comment-16547337 ] zenglinxi edited comment on SPARK-24809 at 7/18/18 4:09 AM: [^Spark LongHashedRelation serialization.svg] I think it's a hidden but critical bug that may cause data error. {code:java} // code in HashedRelation.scala private def write( writeBoolean: (Boolean) => Unit, writeLong: (Long) => Unit, writeBuffer: (Array[Byte], Int, Int) => Unit): Unit = { writeBoolean(isDense) writeLong(minKey) writeLong(maxKey) writeLong(numKeys) writeLong(numValues) writeLong(numKeyLookups) writeLong(numProbes) writeLong(array.length) writeLongArray(writeBuffer, array, array.length) val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt writeLong(used) writeLongArray(writeBuffer, page, used) } {code} This write func in HashedRelation.scala will be called when executor didn't have enough memory for the LongToUnsafeRowMap in which the data of broadcast table been saved, however, the value of cursor in executor may not changed after initialization by {code:java} // code placeholder private var cursor: Long = Platform.LONG_ARRAY_OFFSET {code} which makes the value of "used" in write func been zero when write to disk, then in the case of deserializing this data in disk will get wrong pointer. Finally, we may get the wrong data from broadcast join. was (Author: gostop_zlx): [^Spark LongHashedRelation serialization.svg] I think it's a hidden but critical bug that may cause data error. > Serializing LongHashedRelation in executor may result in data error > --- > > Key: SPARK-24809 > URL: https://issues.apache.org/jira/browse/SPARK-24809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 > Environment: Spark 2.2.1 > hadoop 2.7.1 >Reporter: Lijia Liu >Priority: Critical > Attachments: Spark LongHashedRelation serialization.svg > > > When join key is long or int in broadcast join, Spark will use > LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if > the broadcast value is abnormal big, executor will serialize it to disk. But, > data will lost when serializing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error
[ https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547337#comment-16547337 ] zenglinxi commented on SPARK-24809: --- [^Spark LongHashedRelation serialization.svg] I think it's a hidden but critical bug that may cause data error. > Serializing LongHashedRelation in executor may result in data error > --- > > Key: SPARK-24809 > URL: https://issues.apache.org/jira/browse/SPARK-24809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 > Environment: Spark 2.2.1 > hadoop 2.7.1 >Reporter: Lijia Liu >Priority: Critical > Attachments: Spark LongHashedRelation serialization.svg > > > When join key is long or int in broadcast join, Spark will use > LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if > the broadcast value is abnormal big, executor will serialize it to disk. But, > data will lost when serializing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error
[ https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-24809: -- Attachment: Spark LongHashedRelation serialization.svg > Serializing LongHashedRelation in executor may result in data error > --- > > Key: SPARK-24809 > URL: https://issues.apache.org/jira/browse/SPARK-24809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 > Environment: Spark 2.2.1 > hadoop 2.7.1 >Reporter: Lijia Liu >Priority: Critical > Attachments: Spark LongHashedRelation serialization.svg > > > When join key is long or int in broadcast join, Spark will use > LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if > the broadcast value is abnormal big, executor will serialize it to disk. But, > data will lost when serializing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24607) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518230#comment-16518230 ] zenglinxi commented on SPARK-24607: --- [~viirya] I have some tests, it seems like rand() func in sparksql did't use a seed by default, and it's possible that input of rows has a different order when the data comes from different files of hdfs. {panel:title=Spark rand expression test} spark-sql> select cityid, rand() from dim.city limit 10; 0 0.39429614142689984 1 0.5574738843620611 10 0.42368551279674027 20 0.883557920875731 30 0.0742883603842649 40 0.2457896850449358 42 0.9471294566318744 44 0.8039840817521141 45 0.17735580891100933 50 0.48644157965642754 Time taken: 2.302 seconds, Fetched 10 row(s) spark-sql> select cityid, rand() from dim.city limit 10; 0 0.2557189370118823 1 0.6363701777528299 10 0.9959774332130664 20 0.07797999671586286 30 0.9242241907537617 40 0.6444701255846349 42 0.047668721209778164 44 0.1870254876331776 45 0.4832468933696129 50 0.36770710958257924 Time taken: 1.293 seconds, Fetched 10 row(s) spark-sql> select cityid, rand(1) from dim.city limit 10; 0 0.13385709732307427 1 0.5897562959687032 10 0.01540012100242305 20 0.22569943461197162 30 0.9207602095112212 40 0.6222816020094926 42 0.1029837279488438 44 0.6678762139023474 45 0.06748208566157787 50 0.5215181769983375 Time taken: 1.012 seconds, Fetched 10 row(s) spark-sql> select cityid, rand(1) from dim.city limit 10; 0 0.13385709732307427 1 0.5897562959687032 10 0.01540012100242305 20 0.22569943461197162 30 0.9207602095112212 40 0.6222816020094926 42 0.1029837279488438 44 0.6678762139023474 45 0.06748208566157787 50 0.5215181769983375 Time taken: 2.216 seconds, Fetched 10 row(s) {panel} > Distribute by rand() can lead to data inconsistency > --- > > Key: SPARK-24607 > URL: https://issues.apache.org/jira/browse/SPARK-24607 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: zenglinxi >Priority: Major > > Noticed the following queries can give different results: > {code:java} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a;{code} > this issue was first reported by someone using kylin for building cube with > hiveSQL which include distribute by rand, data inconsistency may happen > during failure tolerance operations. Since spark has similar failure > tolerance mechanism, I think it's also an hidden serious problem in sparksql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24607) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-24607: -- Description: Noticed the following queries can give different results: {code:java} select count(*) from tbl; select count(*) from (select * from tbl distribute by rand()) a;{code} this issue was first reported by someone using kylin for building cube with hiveSQL which include distribute by rand, data inconsistency may happen during failure tolerance operations. Since spark has similar failure tolerance mechanism, I think it's also an hidden serious problem in sparksql. was: Noticed the following queries can give different results: {code:java} select count(*) from tbl; select count(*) from (select * from tbl distribute by rand()) a;{code} this issue was first reported by someone using kylin for building cube with hiveSQL which include distribute by rand, may happen during failure tolerance operations, I think it's also an hidden serious problem in sparksql. > Distribute by rand() can lead to data inconsistency > --- > > Key: SPARK-24607 > URL: https://issues.apache.org/jira/browse/SPARK-24607 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: zenglinxi >Priority: Major > > Noticed the following queries can give different results: > {code:java} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a;{code} > this issue was first reported by someone using kylin for building cube with > hiveSQL which include distribute by rand, data inconsistency may happen > during failure tolerance operations. Since spark has similar failure > tolerance mechanism, I think it's also an hidden serious problem in sparksql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24607) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-24607: -- Description: Noticed the following queries can give different results: {code:java} select count(*) from tbl; select count(*) from (select * from tbl distribute by rand()) a;{code} this issue was first reported by someone using kylin for building cube with hiveSQL which include distribute by rand, may happen during failure tolerance operations, I think it's also an hidden serious problem in sparksql. was: Noticed the following queries can give different results: {code:java} select count(*) from tbl; select count(*) from (select * from tbl distribute by rand()) a;{code} this issue was first reported by someone using kylin for building cube with hiveSQL which include distribute by rand, I think it's also an hidden serious problem in sparksql. > Distribute by rand() can lead to data inconsistency > --- > > Key: SPARK-24607 > URL: https://issues.apache.org/jira/browse/SPARK-24607 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: zenglinxi >Priority: Major > > Noticed the following queries can give different results: > {code:java} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a;{code} > this issue was first reported by someone using kylin for building cube with > hiveSQL which include distribute by rand, may happen during failure > tolerance operations, I think it's also an hidden serious problem in sparksql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24607) Distribute by rand() can lead to data inconsistency
zenglinxi created SPARK-24607: - Summary: Distribute by rand() can lead to data inconsistency Key: SPARK-24607 URL: https://issues.apache.org/jira/browse/SPARK-24607 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1, 2.2.0 Reporter: zenglinxi Noticed the following queries can give different results: {code:java} select count(*) from tbl; select count(*) from (select * from tbl distribute by rand()) a;{code} this issue was first reported by someone using kylin for building cube with hiveSQL which include distribute by rand, I think it's also an hidden serious problem in sparksql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21223) Thread-safety issue in FsHistoryProvider
[ https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-21223: -- Attachment: historyserver_jstack.txt BTW, this cause an infinite loop problem when we restart historyserver and replaying event logs of spark apps. > Thread-safety issue in FsHistoryProvider > - > > Key: SPARK-21223 > URL: https://issues.apache.org/jira/browse/SPARK-21223 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: zenglinxi > Attachments: historyserver_jstack.txt > > > Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class > FsHistoryProvider to store the map of eventlog path and attemptInfo. > When use ThreadPool to Replay the log files in the list and merge the list of > old applications with new ones, multi thread may update fileToAppInfo at the > same time, which may cause Thread-safety issues. > {code:java} > for (file <- logInfos) { >tasks += replayExecutor.submit(new Runnable { > override def run(): Unit = mergeApplicationListing(file) > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21223) Thread-safety issue in FsHistoryProvider
[ https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16066319#comment-16066319 ] zenglinxi edited comment on SPARK-21223 at 6/28/17 10:42 AM: - [~sowen] ok, i will check SPARK-21078 first. was (Author: gostop_zlx): ok, i will check SPARK-21078 first. > Thread-safety issue in FsHistoryProvider > - > > Key: SPARK-21223 > URL: https://issues.apache.org/jira/browse/SPARK-21223 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: zenglinxi > > Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class > FsHistoryProvider to store the map of eventlog path and attemptInfo. > When use ThreadPool to Replay the log files in the list and merge the list of > old applications with new ones, multi thread may update fileToAppInfo at the > same time, which may cause Thread-safety issues. > {code:java} > for (file <- logInfos) { >tasks += replayExecutor.submit(new Runnable { > override def run(): Unit = mergeApplicationListing(file) > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21223) Thread-safety issue in FsHistoryProvider
[ https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16066319#comment-16066319 ] zenglinxi commented on SPARK-21223: --- ok, i will check SPARK-21078 first. > Thread-safety issue in FsHistoryProvider > - > > Key: SPARK-21223 > URL: https://issues.apache.org/jira/browse/SPARK-21223 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: zenglinxi > > Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class > FsHistoryProvider to store the map of eventlog path and attemptInfo. > When use ThreadPool to Replay the log files in the list and merge the list of > old applications with new ones, multi thread may update fileToAppInfo at the > same time, which may cause Thread-safety issues. > {code:java} > for (file <- logInfos) { >tasks += replayExecutor.submit(new Runnable { > override def run(): Unit = mergeApplicationListing(file) > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21223) Thread-safety issue in FsHistoryProvider
zenglinxi created SPARK-21223: - Summary: Thread-safety issue in FsHistoryProvider Key: SPARK-21223 URL: https://issues.apache.org/jira/browse/SPARK-21223 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.1 Reporter: zenglinxi Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class FsHistoryProvider to store the map of eventlog path and attemptInfo. When use ThreadPool to Replay the log files in the list and merge the list of old applications with new ones, multi thread may update fileToAppInfo at the same time, which may cause Thread-safety issues. {code:java} for (file <- logInfos) { tasks += replayExecutor.submit(new Runnable { override def run(): Unit = mergeApplicationListing(file) }) } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table
[ https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-20240: -- Affects Version/s: (was: 2.1.0) > SparkSQL support limitations of max dynamic partitions when inserting hive > table > > > Key: SPARK-20240 > URL: https://issues.apache.org/jira/browse/SPARK-20240 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 1.6.3 >Reporter: zenglinxi > > We found that HDFS problem occurs sometimes when user have a typo in their > code while using SparkSQL inserting data into a partition table. > For Example: > create table: > {quote} > create table test_tb ( >price double, > ) PARTITIONED BY (day_partition string ,hour_partition string) > {quote} > normal sql for inserting table: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select price, day_partition, hour_partition from other_table; > {quote} > sql with typo: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select hour_partition, day_partition, price from other_table; > {quote} > This typo makes SparkSQL take column "price" as "hour_partition", which may > create million HDFS files in short time if the "other_table" has large data > with a wide range of "price" and give rise to awful performance of NameNode > RPC. > We think it's a good idea to limit the maximum number of files allowed to be > create by each task for protecting HDFS NameNode from unconscious error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table
[ https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-20240: -- Affects Version/s: 1.6.3 > SparkSQL support limitations of max dynamic partitions when inserting hive > table > > > Key: SPARK-20240 > URL: https://issues.apache.org/jira/browse/SPARK-20240 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 1.6.3, 2.1.0 >Reporter: zenglinxi > > We found that HDFS problem occurs sometimes when user have a typo in their > code while using SparkSQL inserting data into a partition table. > For Example: > create table: > {quote} > create table test_tb ( >price double, > ) PARTITIONED BY (day_partition string ,hour_partition string) > {quote} > normal sql for inserting table: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select price, day_partition, hour_partition from other_table; > {quote} > sql with typo: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select hour_partition, day_partition, price from other_table; > {quote} > This typo makes SparkSQL take column "price" as "hour_partition", which may > create million HDFS files in short time if the "other_table" has large data > with a wide range of "price" and give rise to awful performance of NameNode > RPC. > We think it's a good idea to limit the maximum number of files allowed to be > create by each task for protecting HDFS NameNode from unconscious error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table
[ https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958573#comment-15958573 ] zenglinxi commented on SPARK-20240: --- There is a configuration parameter in hive that can limit the maximum number of dynamic partitions allowed to be created in each mapper/reducer named hive.exec.max.dynamic.partitions.pernode which I think can also works in SparkSQL. > SparkSQL support limitations of max dynamic partitions when inserting hive > table > > > Key: SPARK-20240 > URL: https://issues.apache.org/jira/browse/SPARK-20240 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.1.0 >Reporter: zenglinxi > > We found that HDFS problem occurs sometimes when user have a typo in their > code while using SparkSQL inserting data into a partition table. > For Example: > create table: > {quote} > create table test_tb ( >price double, > ) PARTITIONED BY (day_partition string ,hour_partition string) > {quote} > normal sql for inserting table: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select price, day_partition, hour_partition from other_table; > {quote} > sql with typo: > {quote} > insert overwrite table test_tb partition(day_partition, hour_partition) > select hour_partition, day_partition, price from other_table; > {quote} > This typo makes SparkSQL take column "price" as "hour_partition", which may > create million HDFS files in short time if the "other_table" has large data > with a wide range of "price" and give rise to awful performance of NameNode > RPC. > We think it's a good idea to limit the maximum number of files allowed to be > create by each task for protecting HDFS NameNode from unconscious error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table
zenglinxi created SPARK-20240: - Summary: SparkSQL support limitations of max dynamic partitions when inserting hive table Key: SPARK-20240 URL: https://issues.apache.org/jira/browse/SPARK-20240 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0, 1.6.2 Reporter: zenglinxi We found that HDFS problem occurs sometimes when user have a typo in their code while using SparkSQL inserting data into a partition table. For Example: create table: {quote} create table test_tb ( price double, ) PARTITIONED BY (day_partition string ,hour_partition string) {quote} normal sql for inserting table: {quote} insert overwrite table test_tb partition(day_partition, hour_partition) select price, day_partition, hour_partition from other_table; {quote} sql with typo: {quote} insert overwrite table test_tb partition(day_partition, hour_partition) select hour_partition, day_partition, price from other_table; {quote} This typo makes SparkSQL take column "price" as "hour_partition", which may create million HDFS files in short time if the "other_table" has large data with a wide range of "price" and give rise to awful performance of NameNode RPC. We think it's a good idea to limit the maximum number of files allowed to be create by each task for protecting HDFS NameNode from unconscious error. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13819) using a regexp_replace in a group by clause raises a nullpointerexception
[ https://issues.apache.org/jira/browse/SPARK-13819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611589#comment-15611589 ] zenglinxi commented on SPARK-13819: --- We encountered the same problem, is there any progress? > using a regexp_replace in a group by clause raises a nullpointerexception > - > > Key: SPARK-13819 > URL: https://issues.apache.org/jira/browse/SPARK-13819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Javier Pérez > > 1. Start start-thriftserver.sh > 2. connect with beeline > 3. Perform the following query over a table: > SELECT t0.textsample > FROM test t0 > ORDER BY regexp_replace( > t0.code, > concat('\\Q', 'a', '\\E'), > regexp_replace( >regexp_replace('zz', '', ''), > '\\$', > '\\$')) DESC; > Problem: NullPointerException > Trace: > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.RegExpReplace.nullSafeEval(regexpExpressions.scala:224) > at > org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:458) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:36) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:27) > at scala.math.Ordering$class.gt(Ordering.scala:97) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.gt(ordering.scala:27) > at org.apache.spark.RangePartitioner.getPartition(Partitioner.scala:168) > at > org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180) > at > org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16253) make spark sql compatible with hive sql that using python script transform like using 'xxx.py'
[ https://issues.apache.org/jira/browse/SPARK-16253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15424388#comment-15424388 ] zenglinxi commented on SPARK-16253: --- wait a minute please, I'm working on this PR > make spark sql compatible with hive sql that using python script transform > like using 'xxx.py' > -- > > Key: SPARK-16253 > URL: https://issues.apache.org/jira/browse/SPARK-16253 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.6.2 >Reporter: zenglinxi > > Some hive sql like: > {quote} > add file /tmp/spark_sql_test/test.py; > select transform(cityname) using 'test.py' as (new_cityname) from > test.spark2_orc where dt='20160622' limit 5 ; > {quote} > can't be executed by spark sql directly, since it will return error like: > {quote} > 16/06/26 11:01:28 INFO codegen.GenerateUnsafeProjection: Code generated in > 19.054534 ms > 16/06/26 11:01:28 ERROR execution.ScriptTransformationWriterThread: > /bin/bash: test.py: command not found > {quote} > and the sql works fine in hive with MR. > Lots of ETL can't be moved from hive to spark sql because of this problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on
[ https://issues.apache.org/jira/browse/SPARK-16408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-16408: -- Description: when using Spark-sql to execute sql like: {noformat} add file hdfs://xxx/user/test; {noformat} if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an exception like: {noformat} org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a directory and recursive is not turned on. at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340) at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) {noformat} was: when use Spark-sql to execute sql like: {noformat} add file hdfs://xxx/user/test; {noformat} if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an exception like: {noformat} org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a directory and recursive is not turned on. at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340) at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) {noformat} > SparkSQL Added file get Exception: is a directory and recursive is not turned > on > > > Key: SPARK-16408 > URL: https://issues.apache.org/jira/browse/SPARK-16408 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.6.2 >Reporter: zenglinxi > > when using Spark-sql to execute sql like: > {noformat} > add file hdfs://xxx/user/test; > {noformat} > if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an > exception like: > {noformat} > org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a > directory and recursive is not turned on. >at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372) >at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340) >at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117) >at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) >at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) >at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on
[ https://issues.apache.org/jira/browse/SPARK-16408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365668#comment-15365668 ] zenglinxi commented on SPARK-16408: --- I think we should add an parameter (spark.input.dir.recursive) to control the value of recursive, and make this parameter works by modify some code, like: {noformat} diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala index 6b16d59..3be8553 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala @@ -113,8 +113,9 @@ case class AddFile(path: String) extends RunnableCommand { override def run(sqlContext: SQLContext): Seq[Row] = { val hiveContext = sqlContext.asInstanceOf[HiveContext] +val recursive = sqlContext.sparkContext.getConf.getBoolean("spark.input.dir.recursive", false) hiveContext.runSqlHive(s"ADD FILE $path") -hiveContext.sparkContext.addFile(path) +hiveContext.sparkContext.addFile(path, recursive) Seq.empty[Row] } } {noformat} > SparkSQL Added file get Exception: is a directory and recursive is not turned > on > > > Key: SPARK-16408 > URL: https://issues.apache.org/jira/browse/SPARK-16408 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.6.2 >Reporter: zenglinxi > > when use Spark-sql to execute sql like: > {quote} > add file hdfs://xxx/user/test; > {quote} > if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an > exception like: > {quote} > org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a > directory and recursive is not turned on. >at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372) >at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340) >at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117) >at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) >at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) >at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on
[ https://issues.apache.org/jira/browse/SPARK-16408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-16408: -- Description: when use Spark-sql to execute sql like: {noformat} add file hdfs://xxx/user/test; {noformat} if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an exception like: {noformat} org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a directory and recursive is not turned on. at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340) at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) {noformat} was: when use Spark-sql to execute sql like: {quote} add file hdfs://xxx/user/test; {quote} if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an exception like: {quote} org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a directory and recursive is not turned on. at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340) at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) {quote} > SparkSQL Added file get Exception: is a directory and recursive is not turned > on > > > Key: SPARK-16408 > URL: https://issues.apache.org/jira/browse/SPARK-16408 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.6.2 >Reporter: zenglinxi > > when use Spark-sql to execute sql like: > {noformat} > add file hdfs://xxx/user/test; > {noformat} > if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an > exception like: > {noformat} > org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a > directory and recursive is not turned on. >at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372) >at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340) >at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117) >at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) >at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) >at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on
[ https://issues.apache.org/jira/browse/SPARK-16408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365658#comment-15365658 ] zenglinxi edited comment on SPARK-16408 at 7/7/16 6:09 AM: --- as shown in https://issues.apache.org/jira/browse/SPARK-4687, we have two functions in SparkContext.scala: {noformat} def addFile(path: String): Unit = { addFile(path, false) } def addFile(path: String, recursive: Boolean): Unit = { ... } {noformat} But there are no config to turn on or off recursive, and spark always call addFile(path) in default, which means the value of recursive is false, this is why we get the exceptions. was (Author: gostop_zlx): as shown in https://issues.apache.org/jira/browse/SPARK-4687, we have two functions in SparkContext.scala: {panel} def addFile(path: String): Unit = { addFile(path, false) } def addFile(path: String, recursive: Boolean): Unit = { ... } {panel} But there are no config to turn on or off recursive, and spark always call addFile(path) in default, which means the value of recursive is false, this is why we get the exceptions. > SparkSQL Added file get Exception: is a directory and recursive is not turned > on > > > Key: SPARK-16408 > URL: https://issues.apache.org/jira/browse/SPARK-16408 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.6.2 >Reporter: zenglinxi > > when use Spark-sql to execute sql like: > {quote} > add file hdfs://xxx/user/test; > {quote} > if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an > exception like: > {quote} > org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a > directory and recursive is not turned on. >at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372) >at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340) >at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117) >at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) >at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) >at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on
[ https://issues.apache.org/jira/browse/SPARK-16408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365658#comment-15365658 ] zenglinxi edited comment on SPARK-16408 at 7/7/16 6:07 AM: --- as shown in https://issues.apache.org/jira/browse/SPARK-4687, we have two functions in SparkContext.scala: {panel} def addFile(path: String): Unit = { addFile(path, false) } def addFile(path: String, recursive: Boolean): Unit = { ... } {panel} But there are no config to turn on or off recursive, and spark always call addFile(path) in default, which means the value of recursive is false, this is why we get the exceptions. was (Author: gostop_zlx): as shown in https://issues.apache.org/jira/browse/SPARK-4687, we have two functions in SparkContext.scala: {quote} def addFile(path: String): Unit = { addFile(path, false) } def addFile(path: String, recursive: Boolean): Unit = { ... } {quote} But there are no config to turn on or off recursive, and spark always call addFile(path) in default, which means the value of recursive is false, this is why we get the exceptions. > SparkSQL Added file get Exception: is a directory and recursive is not turned > on > > > Key: SPARK-16408 > URL: https://issues.apache.org/jira/browse/SPARK-16408 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.6.2 >Reporter: zenglinxi > > when use Spark-sql to execute sql like: > {quote} > add file hdfs://xxx/user/test; > {quote} > if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an > exception like: > {quote} > org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a > directory and recursive is not turned on. >at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372) >at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340) >at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117) >at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) >at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) >at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on
[ https://issues.apache.org/jira/browse/SPARK-16408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365658#comment-15365658 ] zenglinxi commented on SPARK-16408: --- as shown in https://issues.apache.org/jira/browse/SPARK-4687, we have two functions in SparkContext.scala: {quote} def addFile(path: String): Unit = { addFile(path, false) } def addFile(path: String, recursive: Boolean): Unit = { ... } {quote} But there are no config to turn on or off recursive, and spark always call addFile(path) in default, which means the value of recursive is false, this is why we get the exceptions. > SparkSQL Added file get Exception: is a directory and recursive is not turned > on > > > Key: SPARK-16408 > URL: https://issues.apache.org/jira/browse/SPARK-16408 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.6.2 >Reporter: zenglinxi > > when use Spark-sql to execute sql like: > {quote} > add file hdfs://xxx/user/test; > {quote} > if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an > exception like: > {quote} > org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a > directory and recursive is not turned on. >at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372) >at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340) >at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117) >at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) >at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) >at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on
zenglinxi created SPARK-16408: - Summary: SparkSQL Added file get Exception: is a directory and recursive is not turned on Key: SPARK-16408 URL: https://issues.apache.org/jira/browse/SPARK-16408 Project: Spark Issue Type: Task Components: SQL Affects Versions: 1.6.2 Reporter: zenglinxi when use Spark-sql to execute sql like: {quote} add file hdfs://xxx/user/test; {quote} if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an exception like: {quote} org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a directory and recursive is not turned on. at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372) at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340) at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16253) make spark sql compatible with hive sql that using python script transform like using 'xxx.py'
[ https://issues.apache.org/jira/browse/SPARK-16253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352917#comment-15352917 ] zenglinxi commented on SPARK-16253: --- I'm working on this issue... > make spark sql compatible with hive sql that using python script transform > like using 'xxx.py' > -- > > Key: SPARK-16253 > URL: https://issues.apache.org/jira/browse/SPARK-16253 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.6.2 >Reporter: zenglinxi > > Some hive sql like: > {quote} > add file /tmp/spark_sql_test/test.py; > select transform(cityname) using 'test.py' as (new_cityname) from > test.spark2_orc where dt='20160622' limit 5 ; > {quote} > can't be executed by spark sql directly, since it will return error like: > {quote} > 16/06/26 11:01:28 INFO codegen.GenerateUnsafeProjection: Code generated in > 19.054534 ms > 16/06/26 11:01:28 ERROR execution.ScriptTransformationWriterThread: > /bin/bash: test.py: command not found > {quote} > and the sql works fine in hive with MR. > Lots of ETL can't be moved from hive to spark sql because of this problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16253) make spark sql compatible with hive sql that using python script transform like using 'xxx.py'
zenglinxi created SPARK-16253: - Summary: make spark sql compatible with hive sql that using python script transform like using 'xxx.py' Key: SPARK-16253 URL: https://issues.apache.org/jira/browse/SPARK-16253 Project: Spark Issue Type: Task Components: SQL Affects Versions: 1.6.2 Reporter: zenglinxi Some hive sql like: {quote} add file /tmp/spark_sql_test/test.py; select transform(cityname) using 'test.py' as (new_cityname) from test.spark2_orc where dt='20160622' limit 5 ; {quote} can't be executed by spark sql directly, since it will return error like: {quote} 16/06/26 11:01:28 INFO codegen.GenerateUnsafeProjection: Code generated in 19.054534 ms 16/06/26 11:01:28 ERROR execution.ScriptTransformationWriterThread: /bin/bash: test.py: command not found {quote} and the sql works fine in hive with MR. Lots of ETL can't be moved from hive to spark sql because of this problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table
[ https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262031#comment-15262031 ] zenglinxi commented on SPARK-14974: --- hi,[~sowen]: Thanks for your reply. "200w" means 2,000,000. And about "Are you just saying this problem occurs when you have a typo in your code? ", yes it occurs when there are some problems in the code, but we can't keep every spark users from submitting jobs with 'dangerous' code. The only thing we can do is finding and killing this type of jobs as soon as possible. And since Hive (on MapReduce) can limit the max files created by mapreduce job with hive configuration parameter (ex.: hive.exec.max.dynamic.partitions.pernode, hive.exec.max.created.files), I think spark can(should) also do this. > spark sql job create too many files in HDFS when doing insert overwrite hive > table > -- > > Key: SPARK-14974 > URL: https://issues.apache.org/jira/browse/SPARK-14974 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.5.2 >Reporter: zenglinxi > > Recently, we often encounter problems using spark sql for inserting data into > a partition table (ex.: insert overwrite table $output_table partition(dt) > select xxx from tmp_table). > After the spark job start running on yarn, the app will create too many files > (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous > pressure. > We found that the num of files created by spark job is depending on the > partition num of hive table that will be inserted and the num of spark sql > partitions. > files_num = hive_table_partions_num * spark_sql_partitions_num. > We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= > 1000, and the hive_table_partions_num is very small under normal > circumstances, but it will turn out to be more than 2000 when we input a > wrong field as the partion field unconsciously, which will make the files_num > >= 1000 * 2000 = 2,000,000. > There is a configuration parameter in hive that can limit the maximum number > of dynamic partitions allowed to be created in each mapper/reducer named > hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work > when we use hiveContext. > Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the > files_num be smaller, but it will affect the concurrency. > Can we create configuration parameters to limit the maximum number of files > allowed to be create by each task or limit the spark_sql_partitions_num > without affect the concurrency? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table
[ https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-14974: -- Description: Recently, we often encounter problems using spark sql for inserting data into a partition table (ex.: insert overwrite table $output_table partition(dt) select xxx from tmp_table). After the spark job start running on yarn, the app will create too many files (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous pressure. We found that the num of files created by spark job is depending on the partition num of hive table that will be inserted and the num of spark sql partitions. files_num = hive_table_partions_num * spark_sql_partitions_num. We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 1000, and the hive_table_partions_num is very small under normal circumstances, but it will turn out to be more than 2000 when we input a wrong field as the partion field unconsciously, which will make the files_num >= 1000 * 2000 = 2,000,000. There is a configuration parameter in hive that can limit the maximum number of dynamic partitions allowed to be created in each mapper/reducer named hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work when we use hiveContext. Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the files_num be smaller, but it will affect the concurrency. Can we create configuration parameters to limit the maximum number of files allowed to be create by each task or limit the spark_sql_partitions_num without affect the concurrency? was: Recently, we often encounter problems using spark sql for inserting data into a partition table (ex.: insert overwrite table $output_table partition(dt) select xxx from tmp_table). After the spark job start running on yarn, the app will create too many files (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous pressure. We found that the num of files created by spark job is depending on the partition num of hive table that will be inserted and the num of spark sql partitions. files_num = hive_table_partions_num * spark_sql_partitions_num. We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 1000, and the hive_table_partions_num is very small under normal circumstances, but it will turn out to be more than 2000 when we input a wrong field as the partion field unconsciously, which will make the files_num >= 1000 * 2000 = 2,000,000. There is a configuration parameter in hive that can limit the maximum number of dynamic partitions allowed to be created in each mapper/reducer named hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work when we use hiveContext. Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the files_num be smaller, but it will affect the concurrency. > spark sql job create too many files in HDFS when doing insert overwrite hive > table > -- > > Key: SPARK-14974 > URL: https://issues.apache.org/jira/browse/SPARK-14974 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.5.2 >Reporter: zenglinxi > > Recently, we often encounter problems using spark sql for inserting data into > a partition table (ex.: insert overwrite table $output_table partition(dt) > select xxx from tmp_table). > After the spark job start running on yarn, the app will create too many files > (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous > pressure. > We found that the num of files created by spark job is depending on the > partition num of hive table that will be inserted and the num of spark sql > partitions. > files_num = hive_table_partions_num * spark_sql_partitions_num. > We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= > 1000, and the hive_table_partions_num is very small under normal > circumstances, but it will turn out to be more than 2000 when we input a > wrong field as the partion field unconsciously, which will make the files_num > >= 1000 * 2000 = 2,000,000. > There is a configuration parameter in hive that can limit the maximum number > of dynamic partitions allowed to be created in each mapper/reducer named > hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work > when we use hiveContext. > Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the > files_num be smaller, but it will affect the concurrency. > Can we create configuration parameters to limit the maximum number of files > allowed to be create by each task or limit the spark_sql_partitions_num > without affect the concurrency? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table
[ https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-14974: -- Description: Recently, we often encounter problems using spark sql for inserting data into a partition table (ex.: insert overwrite table $output_table partition(dt) select xxx from tmp_table). After the spark job start running on yarn, the app will create too many files (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous pressure. We found that the num of files created by spark job is depending on the partition num of hive table that will be inserted and the num of spark sql partitions. files_num = hive_table_partions_num * spark_sql_partitions_num. We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 1000, and the hive_table_partions_num is very small under normal circumstances, but it will turn out to be more than 2000 when we input a wrong field as the partion field unconsciously, which will make the files_num >= 1000 * 2000 = 2,000,000. There is a configuration parameter in hive that can limit the maximum number of dynamic partitions allowed to be created in each mapper/reducer named hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work when we use hiveContext. Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the files_num be smaller, but it will affect the concurrency. was: Recently, we often encounter problems using spark sql for inserting data into a partition table (ex.: insert overwrite table $output_table partition(dt) select xxx from tmp_table). After the spark job start running on yarn, the app will create too many files (ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure. We found that the num of files created by spark job is depending on the partition num of hive table that will be inserted and the num of spark sql partitions. files_num = hive_table_partions_num * spark_sql_partitions_num. We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 1000, and the hive_table_partions_num is very small under normal circumstances, but it will turn out to be more than 2000 when we input a wrong field as the partion field unconsciously, which will make the files_num >= 1000 * 2000 = 200w. There is a configuration parameter in hive that can limit the maximum number of dynamic partitions allowed to be created in each mapper/reducer named hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work when we use hiveContext. Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the files_num be smaller, but it will affect the concurrency. > spark sql job create too many files in HDFS when doing insert overwrite hive > table > -- > > Key: SPARK-14974 > URL: https://issues.apache.org/jira/browse/SPARK-14974 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.5.2 >Reporter: zenglinxi > > Recently, we often encounter problems using spark sql for inserting data into > a partition table (ex.: insert overwrite table $output_table partition(dt) > select xxx from tmp_table). > After the spark job start running on yarn, the app will create too many files > (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous > pressure. > We found that the num of files created by spark job is depending on the > partition num of hive table that will be inserted and the num of spark sql > partitions. > files_num = hive_table_partions_num * spark_sql_partitions_num. > We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= > 1000, and the hive_table_partions_num is very small under normal > circumstances, but it will turn out to be more than 2000 when we input a > wrong field as the partion field unconsciously, which will make the files_num > >= 1000 * 2000 = 2,000,000. > There is a configuration parameter in hive that can limit the maximum number > of dynamic partitions allowed to be created in each mapper/reducer named > hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work > when we use hiveContext. > Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the > files_num be smaller, but it will affect the concurrency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table
[ https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-14974: -- Summary: spark sql job create too many files in HDFS when doing insert overwrite hive table (was: spark sql insert hive table write too many files) > spark sql job create too many files in HDFS when doing insert overwrite hive > table > -- > > Key: SPARK-14974 > URL: https://issues.apache.org/jira/browse/SPARK-14974 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.5.2 >Reporter: zenglinxi > > Recently, we often encounter problems using spark sql for inserting data into > a partition table (ex.: insert overwrite table $output_table partition(dt) > select xxx from tmp_table). > After the spark job start running on yarn, the app will create too many files > (ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure. > We found that the num of files created by spark job is depending on the > partition num of hive table that will be inserted and the num of spark sql > partitions. > files_num = hive_table_partions_num * spark_sql_partitions_num. > We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= > 1000, and the hive_table_partions_num is very small under normal > circumstances, but it will turn out to be more than 2000 when we input a > wrong field as the partion field unconsciously, which will make the files_num > >= 1000 * 2000 = 200w. > There is a configuration parameter in hive that can limit the maximum number > of dynamic partitions allowed to be created in each mapper/reducer named > hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work > when we use hiveContext. > Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the > files_num be smaller, but it will affect the concurrency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14974) spark sql insert hive table write too many files
[ https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-14974: -- Description: Recently, we often encounter problems using spark sql for inserting data into a partition table (ex.: insert overwrite table $output_table partition(dt) select xxx from tmp_table). After the spark job start running on yarn, the app will create too many files (ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure. We found that the num of files created by spark job is depending on the partition num of hive table that will be inserted and the num of spark sql partitions. files_num = hive_table_partions_num * spark_sql_partitions_num. We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 1000, and the hive_table_partions_num is very small under normal circumstances, but it will turn out to be more than 2000 when we input a wrong field as the partion field unconsciously, which will make the files_num >= 1000 * 2000 = 200w. There is a configuration parameter in hive that can limit the maximum number of dynamic partitions allowed to be created in each mapper/reducer named hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work when we use hiveContext. Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the files_num be smaller, but it will affect the concurrency. was: Recently, we often encounter problems using spark sql for inserting data into a partition table (ex.: insert overwrite table $output_table partition(dt) select xxx from tmp_table). After the spark job start running on yarn, the app will create too many files (ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure. We found that the num of files created by spark job is depending on the partition num of hive table that will be inserted and the num of spark sql partitions. files_num = hive_table_partions_num * spark_sql_partitions_num. We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 1000, and the hive_table_partions_num is very small under normal circumstances, but it will turn out to be more than 2000 when we input a wrong field as the partion field unconsciously, which will make the files_num >= 1000 * 2000 = 200w. > spark sql insert hive table write too many files > > > Key: SPARK-14974 > URL: https://issues.apache.org/jira/browse/SPARK-14974 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.5.2 >Reporter: zenglinxi > > Recently, we often encounter problems using spark sql for inserting data into > a partition table (ex.: insert overwrite table $output_table partition(dt) > select xxx from tmp_table). > After the spark job start running on yarn, the app will create too many files > (ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure. > We found that the num of files created by spark job is depending on the > partition num of hive table that will be inserted and the num of spark sql > partitions. > files_num = hive_table_partions_num * spark_sql_partitions_num. > We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= > 1000, and the hive_table_partions_num is very small under normal > circumstances, but it will turn out to be more than 2000 when we input a > wrong field as the partion field unconsciously, which will make the files_num > >= 1000 * 2000 = 200w. > There is a configuration parameter in hive that can limit the maximum number > of dynamic partitions allowed to be created in each mapper/reducer named > hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work > when we use hiveContext. > Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the > files_num be smaller, but it will affect the concurrency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14974) spark sql insert hive table write too many files
[ https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-14974: -- Description: Recently, we often encounter problems using spark sql for inserting data into a partition table (ex.: insert overwrite table $output_table partition(dt) select xxx from tmp_table). After the spark job start running on yarn, the app will create too many files (ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure. We found that the num of files created by spark job is depending on the partition num of hive table that will be inserted and the num of spark sql partitions. files_num = hive_table_partions_num * spark_sql_partitions_num. We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 1000, and the hive_table_partions_num is very small under normal circumstances, but it will turn out to be more than 2000 when we input a wrong field as the partion field unconsciously, which will make the files_num >= 1000 * 2000 = 200w. > spark sql insert hive table write too many files > > > Key: SPARK-14974 > URL: https://issues.apache.org/jira/browse/SPARK-14974 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.5.2 >Reporter: zenglinxi > > Recently, we often encounter problems using spark sql for inserting data into > a partition table (ex.: insert overwrite table $output_table partition(dt) > select xxx from tmp_table). > After the spark job start running on yarn, the app will create too many files > (ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure. > We found that the num of files created by spark job is depending on the > partition num of hive table that will be inserted and the num of spark sql > partitions. > files_num = hive_table_partions_num * spark_sql_partitions_num. > We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= > 1000, and the hive_table_partions_num is very small under normal > circumstances, but it will turn out to be more than 2000 when we input a > wrong field as the partion field unconsciously, which will make the files_num > >= 1000 * 2000 = 200w. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14974) spark sql insert hive table write too many files
[ https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zenglinxi updated SPARK-14974: -- Summary: spark sql insert hive table write too many files (was: spark sql insert hive table write too much files) > spark sql insert hive table write too many files > > > Key: SPARK-14974 > URL: https://issues.apache.org/jira/browse/SPARK-14974 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 1.5.2 >Reporter: zenglinxi > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14974) spark sql insert hive table write too much files
zenglinxi created SPARK-14974: - Summary: spark sql insert hive table write too much files Key: SPARK-14974 URL: https://issues.apache.org/jira/browse/SPARK-14974 Project: Spark Issue Type: Task Components: SQL Affects Versions: 1.5.2 Reporter: zenglinxi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org