[jira] [Comment Edited] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

2018-07-17 Thread zenglinxi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547337#comment-16547337
 ] 

zenglinxi edited comment on SPARK-24809 at 7/18/18 4:10 AM:


[^Spark LongHashedRelation serialization.svg]

I think it's a hidden but critical bug that may cause data error.

 
{code:java}
// code in HashedRelation.scala
private def write(
writeBoolean: (Boolean) => Unit,
writeLong: (Long) => Unit,
writeBuffer: (Array[Byte], Int, Int) => Unit): Unit = {
  writeBoolean(isDense)
  writeLong(minKey)
  writeLong(maxKey)
  writeLong(numKeys)
  writeLong(numValues)
  writeLong(numKeyLookups)
  writeLong(numProbes)

  writeLong(array.length)
  writeLongArray(writeBuffer, array, array.length)
  val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt
  writeLong(used)
  writeLongArray(writeBuffer, page, used)
}
{code}
This write func in HashedRelation.scala will be called when executor didn't 
have enough memory for the LongToUnsafeRowMap in which the data of broadcast 
table been saved, however, the value of cursor in executor may not changed 
after initialization by 
{code:java}
// code placeholder
private var cursor: Long = Platform.LONG_ARRAY_OFFSET
{code}
which makes the value of "used" in write func been zero when write to disk, 
then in the case of deserializing this data in disk will get wrong pointer. 
Finally, we may get the wrong data from broadcast join.

 

 

 


was (Author: gostop_zlx):
[^Spark LongHashedRelation serialization.svg]

I think it's a hidden but critical bug that may cause data error.

 
{code:java}
// code in HashedRelation.scala
private def write(
writeBoolean: (Boolean) => Unit,
writeLong: (Long) => Unit,
writeBuffer: (Array[Byte], Int, Int) => Unit): Unit = {
  writeBoolean(isDense)
  writeLong(minKey)
  writeLong(maxKey)
  writeLong(numKeys)
  writeLong(numValues)
  writeLong(numKeyLookups)
  writeLong(numProbes)

  writeLong(array.length)
  writeLongArray(writeBuffer, array, array.length)
  val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt
  writeLong(used)
  writeLongArray(writeBuffer, page, used)
}
{code}
This write func in HashedRelation.scala will be called when executor didn't 
have enough memory for the LongToUnsafeRowMap in which the data of broadcast 
table been saved, however, the value of cursor in executor may not changed 
after initialization by 
{code:java}
// code placeholder
private var cursor: Long = Platform.LONG_ARRAY_OFFSET
{code}
which makes the value of "used" in write func been zero when write to disk, 
then in the case of deserializing this data in disk will get wrong pointer. 
Finally, we may get the wrong data from broadcast join.

 

 

 

> Serializing LongHashedRelation in executor may result in data error
> ---
>
> Key: SPARK-24809
> URL: https://issues.apache.org/jira/browse/SPARK-24809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
> Environment: Spark 2.2.1
> hadoop 2.7.1
>Reporter: Lijia Liu
>Priority: Critical
> Attachments: Spark LongHashedRelation serialization.svg
>
>
> When join key is long or int in broadcast join, Spark will use 
> LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if 
> the broadcast value is abnormal big, executor will serialize it to disk. But, 
> data will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

2018-07-17 Thread zenglinxi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547337#comment-16547337
 ] 

zenglinxi edited comment on SPARK-24809 at 7/18/18 4:09 AM:


[^Spark LongHashedRelation serialization.svg]

I think it's a hidden but critical bug that may cause data error.

 
{code:java}
// code in HashedRelation.scala
private def write(
writeBoolean: (Boolean) => Unit,
writeLong: (Long) => Unit,
writeBuffer: (Array[Byte], Int, Int) => Unit): Unit = {
  writeBoolean(isDense)
  writeLong(minKey)
  writeLong(maxKey)
  writeLong(numKeys)
  writeLong(numValues)
  writeLong(numKeyLookups)
  writeLong(numProbes)

  writeLong(array.length)
  writeLongArray(writeBuffer, array, array.length)
  val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt
  writeLong(used)
  writeLongArray(writeBuffer, page, used)
}
{code}
This write func in HashedRelation.scala will be called when executor didn't 
have enough memory for the LongToUnsafeRowMap in which the data of broadcast 
table been saved, however, the value of cursor in executor may not changed 
after initialization by 
{code:java}
// code placeholder
private var cursor: Long = Platform.LONG_ARRAY_OFFSET
{code}
which makes the value of "used" in write func been zero when write to disk, 
then in the case of deserializing this data in disk will get wrong pointer. 
Finally, we may get the wrong data from broadcast join.

 

 

 


was (Author: gostop_zlx):
[^Spark LongHashedRelation serialization.svg]

I think it's a hidden but critical bug that may cause data error.

> Serializing LongHashedRelation in executor may result in data error
> ---
>
> Key: SPARK-24809
> URL: https://issues.apache.org/jira/browse/SPARK-24809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
> Environment: Spark 2.2.1
> hadoop 2.7.1
>Reporter: Lijia Liu
>Priority: Critical
> Attachments: Spark LongHashedRelation serialization.svg
>
>
> When join key is long or int in broadcast join, Spark will use 
> LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if 
> the broadcast value is abnormal big, executor will serialize it to disk. But, 
> data will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

2018-07-17 Thread zenglinxi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547337#comment-16547337
 ] 

zenglinxi commented on SPARK-24809:
---

[^Spark LongHashedRelation serialization.svg]

I think it's a hidden but critical bug that may cause data error.

> Serializing LongHashedRelation in executor may result in data error
> ---
>
> Key: SPARK-24809
> URL: https://issues.apache.org/jira/browse/SPARK-24809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
> Environment: Spark 2.2.1
> hadoop 2.7.1
>Reporter: Lijia Liu
>Priority: Critical
> Attachments: Spark LongHashedRelation serialization.svg
>
>
> When join key is long or int in broadcast join, Spark will use 
> LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if 
> the broadcast value is abnormal big, executor will serialize it to disk. But, 
> data will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24809) Serializing LongHashedRelation in executor may result in data error

2018-07-17 Thread zenglinxi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-24809:
--
Attachment: Spark LongHashedRelation serialization.svg

> Serializing LongHashedRelation in executor may result in data error
> ---
>
> Key: SPARK-24809
> URL: https://issues.apache.org/jira/browse/SPARK-24809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
> Environment: Spark 2.2.1
> hadoop 2.7.1
>Reporter: Lijia Liu
>Priority: Critical
> Attachments: Spark LongHashedRelation serialization.svg
>
>
> When join key is long or int in broadcast join, Spark will use 
> LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if 
> the broadcast value is abnormal big, executor will serialize it to disk. But, 
> data will lost when serializing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24607) Distribute by rand() can lead to data inconsistency

2018-06-20 Thread zenglinxi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518230#comment-16518230
 ] 

zenglinxi commented on SPARK-24607:
---

[~viirya] 

I have some tests, it seems like rand() func in sparksql did't use a seed by 
default, and it's possible that input of rows has a different order when the 
data comes from different files of hdfs.
{panel:title=Spark rand expression test}
spark-sql> select cityid, rand() from dim.city limit 10;

0       0.39429614142689984

1       0.5574738843620611

10      0.42368551279674027

20      0.883557920875731

30      0.0742883603842649

40      0.2457896850449358

42      0.9471294566318744

44      0.8039840817521141

45      0.17735580891100933

50      0.48644157965642754

Time taken: 2.302 seconds, Fetched 10 row(s)

spark-sql> select cityid, rand() from dim.city limit 10;

0       0.2557189370118823

1       0.6363701777528299

10      0.9959774332130664

20      0.07797999671586286

30      0.9242241907537617

40      0.6444701255846349

42      0.047668721209778164

44      0.1870254876331776

45      0.4832468933696129

50      0.36770710958257924

Time taken: 1.293 seconds, Fetched 10 row(s)

spark-sql> select cityid, rand(1) from dim.city limit 10;

0       0.13385709732307427

1       0.5897562959687032

10      0.01540012100242305

20      0.22569943461197162

30      0.9207602095112212

40      0.6222816020094926

42      0.1029837279488438

44      0.6678762139023474

45      0.06748208566157787

50      0.5215181769983375

Time taken: 1.012 seconds, Fetched 10 row(s)

spark-sql> select cityid, rand(1) from dim.city limit 10;

0       0.13385709732307427

1       0.5897562959687032

10      0.01540012100242305

20      0.22569943461197162

30      0.9207602095112212

40      0.6222816020094926

42      0.1029837279488438

44      0.6678762139023474

45      0.06748208566157787

50      0.5215181769983375

Time taken: 2.216 seconds, Fetched 10 row(s)
{panel}

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: SPARK-24607
> URL: https://issues.apache.org/jira/browse/SPARK-24607
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: zenglinxi
>Priority: Major
>
> Noticed the following queries can give different results:
> {code:java}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;{code}
> this issue was first reported by someone using kylin for building cube with 
> hiveSQL which include  distribute by rand, data inconsistency may happen 
> during failure tolerance operations. Since spark has similar failure 
> tolerance mechanism, I think it's also an hidden serious problem in sparksql.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24607) Distribute by rand() can lead to data inconsistency

2018-06-20 Thread zenglinxi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-24607:
--
Description: 
Noticed the following queries can give different results:
{code:java}
select count(*) from tbl;
select count(*) from (select * from tbl distribute by rand()) a;{code}
this issue was first reported by someone using kylin for building cube with 
hiveSQL which include  distribute by rand, data inconsistency may happen during 
failure tolerance operations. Since spark has similar failure tolerance 
mechanism, I think it's also an hidden serious problem in sparksql.

  was:
Noticed the following queries can give different results:
{code:java}
select count(*) from tbl;
select count(*) from (select * from tbl distribute by rand()) a;{code}
this issue was first reported by someone using kylin for building cube with 
hiveSQL which include  distribute by rand, may happen during failure tolerance 
operations, I think it's also an hidden serious problem in sparksql.


> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: SPARK-24607
> URL: https://issues.apache.org/jira/browse/SPARK-24607
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: zenglinxi
>Priority: Major
>
> Noticed the following queries can give different results:
> {code:java}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;{code}
> this issue was first reported by someone using kylin for building cube with 
> hiveSQL which include  distribute by rand, data inconsistency may happen 
> during failure tolerance operations. Since spark has similar failure 
> tolerance mechanism, I think it's also an hidden serious problem in sparksql.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24607) Distribute by rand() can lead to data inconsistency

2018-06-20 Thread zenglinxi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-24607:
--
Description: 
Noticed the following queries can give different results:
{code:java}
select count(*) from tbl;
select count(*) from (select * from tbl distribute by rand()) a;{code}
this issue was first reported by someone using kylin for building cube with 
hiveSQL which include  distribute by rand, may happen during failure tolerance 
operations, I think it's also an hidden serious problem in sparksql.

  was:
Noticed the following queries can give different results:
{code:java}
select count(*) from tbl;
select count(*) from (select * from tbl distribute by rand()) a;{code}
this issue was first reported by someone using kylin for building cube with 
hiveSQL which include  distribute by rand, I think it's also an hidden serious 
problem in sparksql.


> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: SPARK-24607
> URL: https://issues.apache.org/jira/browse/SPARK-24607
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: zenglinxi
>Priority: Major
>
> Noticed the following queries can give different results:
> {code:java}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;{code}
> this issue was first reported by someone using kylin for building cube with 
> hiveSQL which include  distribute by rand, may happen during failure 
> tolerance operations, I think it's also an hidden serious problem in sparksql.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24607) Distribute by rand() can lead to data inconsistency

2018-06-20 Thread zenglinxi (JIRA)
zenglinxi created SPARK-24607:
-

 Summary: Distribute by rand() can lead to data inconsistency
 Key: SPARK-24607
 URL: https://issues.apache.org/jira/browse/SPARK-24607
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1, 2.2.0
Reporter: zenglinxi


Noticed the following queries can give different results:
{code:java}
select count(*) from tbl;
select count(*) from (select * from tbl distribute by rand()) a;{code}
this issue was first reported by someone using kylin for building cube with 
hiveSQL which include  distribute by rand, I think it's also an hidden serious 
problem in sparksql.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21223) Thread-safety issue in FsHistoryProvider

2017-06-28 Thread zenglinxi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-21223:
--
Attachment: historyserver_jstack.txt

BTW, this cause an infinite loop problem when we restart historyserver and 
replaying event logs of spark apps.

> Thread-safety issue in FsHistoryProvider 
> -
>
> Key: SPARK-21223
> URL: https://issues.apache.org/jira/browse/SPARK-21223
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: zenglinxi
> Attachments: historyserver_jstack.txt
>
>
> Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class 
> FsHistoryProvider to store the map of eventlog path and attemptInfo. 
> When use ThreadPool to Replay the log files in the list and merge the list of 
> old applications with new ones, multi thread may update fileToAppInfo at the 
> same time, which may cause Thread-safety issues.
> {code:java}
> for (file <- logInfos) {
>tasks += replayExecutor.submit(new Runnable {
> override def run(): Unit = mergeApplicationListing(file)
>  })
>  }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21223) Thread-safety issue in FsHistoryProvider

2017-06-28 Thread zenglinxi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16066319#comment-16066319
 ] 

zenglinxi edited comment on SPARK-21223 at 6/28/17 10:42 AM:
-

[~sowen] ok, i will check SPARK-21078 first.


was (Author: gostop_zlx):
ok, i will check SPARK-21078 first.

> Thread-safety issue in FsHistoryProvider 
> -
>
> Key: SPARK-21223
> URL: https://issues.apache.org/jira/browse/SPARK-21223
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: zenglinxi
>
> Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class 
> FsHistoryProvider to store the map of eventlog path and attemptInfo. 
> When use ThreadPool to Replay the log files in the list and merge the list of 
> old applications with new ones, multi thread may update fileToAppInfo at the 
> same time, which may cause Thread-safety issues.
> {code:java}
> for (file <- logInfos) {
>tasks += replayExecutor.submit(new Runnable {
> override def run(): Unit = mergeApplicationListing(file)
>  })
>  }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21223) Thread-safety issue in FsHistoryProvider

2017-06-28 Thread zenglinxi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16066319#comment-16066319
 ] 

zenglinxi commented on SPARK-21223:
---

ok, i will check SPARK-21078 first.

> Thread-safety issue in FsHistoryProvider 
> -
>
> Key: SPARK-21223
> URL: https://issues.apache.org/jira/browse/SPARK-21223
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: zenglinxi
>
> Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class 
> FsHistoryProvider to store the map of eventlog path and attemptInfo. 
> When use ThreadPool to Replay the log files in the list and merge the list of 
> old applications with new ones, multi thread may update fileToAppInfo at the 
> same time, which may cause Thread-safety issues.
> {code:java}
> for (file <- logInfos) {
>tasks += replayExecutor.submit(new Runnable {
> override def run(): Unit = mergeApplicationListing(file)
>  })
>  }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21223) Thread-safety issue in FsHistoryProvider

2017-06-27 Thread zenglinxi (JIRA)
zenglinxi created SPARK-21223:
-

 Summary: Thread-safety issue in FsHistoryProvider 
 Key: SPARK-21223
 URL: https://issues.apache.org/jira/browse/SPARK-21223
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: zenglinxi


Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class 
FsHistoryProvider to store the map of eventlog path and attemptInfo. 
When use ThreadPool to Replay the log files in the list and merge the list of 
old applications with new ones, multi thread may update fileToAppInfo at the 
same time, which may cause Thread-safety issues.
{code:java}
for (file <- logInfos) {
   tasks += replayExecutor.submit(new Runnable {
override def run(): Unit = mergeApplicationListing(file)
 })
 }
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table

2017-04-06 Thread zenglinxi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-20240:
--
Affects Version/s: (was: 2.1.0)

> SparkSQL support limitations of max dynamic partitions when inserting hive 
> table
> 
>
> Key: SPARK-20240
> URL: https://issues.apache.org/jira/browse/SPARK-20240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: zenglinxi
>
> We found that HDFS problem occurs sometimes when user have a typo in their 
> code  while using SparkSQL inserting data into a partition table.
> For Example:
> create table:
>  {quote}
>  create table test_tb (
>price double,
> ) PARTITIONED BY (day_partition string ,hour_partition string)
> {quote}
> normal sql for inserting table:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select price, day_partition, hour_partition from other_table;
>  {quote}
> sql with typo:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select hour_partition, day_partition, price from other_table;
>  {quote}
> This typo makes SparkSQL take column "price" as "hour_partition",  which may 
> create million HDFS files in short time if the "other_table" has large data 
> with a wide range of "price" and give rise to awful performance of NameNode 
> RPC.
> We think it's a good idea to limit the maximum number of files allowed to be 
> create by each task for protecting HDFS NameNode from unconscious error.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table

2017-04-06 Thread zenglinxi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-20240:
--
Affects Version/s: 1.6.3

> SparkSQL support limitations of max dynamic partitions when inserting hive 
> table
> 
>
> Key: SPARK-20240
> URL: https://issues.apache.org/jira/browse/SPARK-20240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3, 2.1.0
>Reporter: zenglinxi
>
> We found that HDFS problem occurs sometimes when user have a typo in their 
> code  while using SparkSQL inserting data into a partition table.
> For Example:
> create table:
>  {quote}
>  create table test_tb (
>price double,
> ) PARTITIONED BY (day_partition string ,hour_partition string)
> {quote}
> normal sql for inserting table:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select price, day_partition, hour_partition from other_table;
>  {quote}
> sql with typo:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select hour_partition, day_partition, price from other_table;
>  {quote}
> This typo makes SparkSQL take column "price" as "hour_partition",  which may 
> create million HDFS files in short time if the "other_table" has large data 
> with a wide range of "price" and give rise to awful performance of NameNode 
> RPC.
> We think it's a good idea to limit the maximum number of files allowed to be 
> create by each task for protecting HDFS NameNode from unconscious error.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table

2017-04-06 Thread zenglinxi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958573#comment-15958573
 ] 

zenglinxi commented on SPARK-20240:
---

There is a configuration parameter in hive that can limit the maximum number of 
dynamic partitions allowed to be created in each mapper/reducer named 
hive.exec.max.dynamic.partitions.pernode which I think can also works in 
SparkSQL.

> SparkSQL support limitations of max dynamic partitions when inserting hive 
> table
> 
>
> Key: SPARK-20240
> URL: https://issues.apache.org/jira/browse/SPARK-20240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.1.0
>Reporter: zenglinxi
>
> We found that HDFS problem occurs sometimes when user have a typo in their 
> code  while using SparkSQL inserting data into a partition table.
> For Example:
> create table:
>  {quote}
>  create table test_tb (
>price double,
> ) PARTITIONED BY (day_partition string ,hour_partition string)
> {quote}
> normal sql for inserting table:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select price, day_partition, hour_partition from other_table;
>  {quote}
> sql with typo:
>  {quote}
> insert overwrite table test_tb partition(day_partition, hour_partition) 
> select hour_partition, day_partition, price from other_table;
>  {quote}
> This typo makes SparkSQL take column "price" as "hour_partition",  which may 
> create million HDFS files in short time if the "other_table" has large data 
> with a wide range of "price" and give rise to awful performance of NameNode 
> RPC.
> We think it's a good idea to limit the maximum number of files allowed to be 
> create by each task for protecting HDFS NameNode from unconscious error.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20240) SparkSQL support limitations of max dynamic partitions when inserting hive table

2017-04-06 Thread zenglinxi (JIRA)
zenglinxi created SPARK-20240:
-

 Summary: SparkSQL support limitations of max dynamic partitions 
when inserting hive table
 Key: SPARK-20240
 URL: https://issues.apache.org/jira/browse/SPARK-20240
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0, 1.6.2
Reporter: zenglinxi


We found that HDFS problem occurs sometimes when user have a typo in their code 
 while using SparkSQL inserting data into a partition table.
For Example:
create table:
 {quote}
 create table test_tb (
   price double,
) PARTITIONED BY (day_partition string ,hour_partition string)
{quote}

normal sql for inserting table:
 {quote}
insert overwrite table test_tb partition(day_partition, hour_partition) select 
price, day_partition, hour_partition from other_table;
 {quote}

sql with typo:
 {quote}
insert overwrite table test_tb partition(day_partition, hour_partition) select 
hour_partition, day_partition, price from other_table;
 {quote}
This typo makes SparkSQL take column "price" as "hour_partition",  which may 
create million HDFS files in short time if the "other_table" has large data 
with a wide range of "price" and give rise to awful performance of NameNode RPC.
We think it's a good idea to limit the maximum number of files allowed to be 
create by each task for protecting HDFS NameNode from unconscious error.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13819) using a regexp_replace in a group by clause raises a nullpointerexception

2016-10-27 Thread zenglinxi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611589#comment-15611589
 ] 

zenglinxi commented on SPARK-13819:
---

We encountered the same problem, is there any progress?

> using a regexp_replace in a group by clause raises a nullpointerexception
> -
>
> Key: SPARK-13819
> URL: https://issues.apache.org/jira/browse/SPARK-13819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Javier Pérez
>
> 1. Start start-thriftserver.sh
> 2. connect with beeline
> 3. Perform the following query over a table:
>   SELECT t0.textsample 
>   FROM test t0 
>   ORDER BY regexp_replace(
> t0.code, 
> concat('\\Q', 'a', '\\E'), 
> regexp_replace(
>regexp_replace('zz', '', ''),
> '\\$', 
> '\\$')) DESC;
> Problem: NullPointerException
> Trace:
>  java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.RegExpReplace.nullSafeEval(regexpExpressions.scala:224)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:458)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:36)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:27)
>   at scala.math.Ordering$class.gt(Ordering.scala:97)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.gt(ordering.scala:27)
>   at org.apache.spark.RangePartitioner.getPartition(Partitioner.scala:168)
>   at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
>   at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16253) make spark sql compatible with hive sql that using python script transform like using 'xxx.py'

2016-08-17 Thread zenglinxi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15424388#comment-15424388
 ] 

zenglinxi commented on SPARK-16253:
---

wait a minute please, I'm working on this PR

> make spark sql compatible with hive sql that using python script transform 
> like using 'xxx.py'
> --
>
> Key: SPARK-16253
> URL: https://issues.apache.org/jira/browse/SPARK-16253
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: zenglinxi
>
> Some hive sql like:
> {quote}
> add file /tmp/spark_sql_test/test.py;
> select transform(cityname) using 'test.py' as (new_cityname) from 
> test.spark2_orc where dt='20160622' limit 5 ;
> {quote}
> can't be executed by spark sql directly, since it will return error like:
> {quote}
> 16/06/26 11:01:28 INFO codegen.GenerateUnsafeProjection: Code generated in 
> 19.054534 ms
> 16/06/26 11:01:28 ERROR execution.ScriptTransformationWriterThread: 
> /bin/bash: test.py: command not found
> {quote}
> and the sql works fine in hive with MR.
> Lots of ETL can't be moved from hive to spark sql because of this problem. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on

2016-07-07 Thread zenglinxi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-16408:
--
Description: 
when using Spark-sql to execute sql like:
{noformat}
add file hdfs://xxx/user/test;
{noformat}
if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an 
exception like:
{noformat}
org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a directory 
and recursive is not turned on.
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372)
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
   at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117)
   at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
   at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
   at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
{noformat}

  was:
when use Spark-sql to execute sql like:
{noformat}
add file hdfs://xxx/user/test;
{noformat}
if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an 
exception like:
{noformat}
org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a directory 
and recursive is not turned on.
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372)
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
   at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117)
   at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
   at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
   at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
{noformat}


> SparkSQL Added file get Exception: is a directory and recursive is not turned 
> on
> 
>
> Key: SPARK-16408
> URL: https://issues.apache.org/jira/browse/SPARK-16408
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: zenglinxi
>
> when using Spark-sql to execute sql like:
> {noformat}
> add file hdfs://xxx/user/test;
> {noformat}
> if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an 
> exception like:
> {noformat}
> org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a 
> directory and recursive is not turned on.
>at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372)
>at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
>at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on

2016-07-07 Thread zenglinxi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365668#comment-15365668
 ] 

zenglinxi commented on SPARK-16408:
---

I think we should add an parameter (spark.input.dir.recursive) to control the 
value of recursive, and make this parameter works by modify some code, like:
{noformat}
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala
index 6b16d59..3be8553 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala
@@ -113,8 +113,9 @@ case class AddFile(path: String) extends RunnableCommand {
 
   override def run(sqlContext: SQLContext): Seq[Row] = {
 val hiveContext = sqlContext.asInstanceOf[HiveContext]
+val recursive = 
sqlContext.sparkContext.getConf.getBoolean("spark.input.dir.recursive", false)
 hiveContext.runSqlHive(s"ADD FILE $path")
-hiveContext.sparkContext.addFile(path)
+hiveContext.sparkContext.addFile(path, recursive)
 Seq.empty[Row]
   }
 }
{noformat}


> SparkSQL Added file get Exception: is a directory and recursive is not turned 
> on
> 
>
> Key: SPARK-16408
> URL: https://issues.apache.org/jira/browse/SPARK-16408
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: zenglinxi
>
> when use Spark-sql to execute sql like:
> {quote}
> add file hdfs://xxx/user/test;
> {quote}
> if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an 
> exception like:
> {quote}
> org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a 
> directory and recursive is not turned on.
>at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372)
>at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
>at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on

2016-07-07 Thread zenglinxi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-16408:
--
Description: 
when use Spark-sql to execute sql like:
{noformat}
add file hdfs://xxx/user/test;
{noformat}
if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an 
exception like:
{noformat}
org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a directory 
and recursive is not turned on.
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372)
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
   at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117)
   at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
   at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
   at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
{noformat}

  was:
when use Spark-sql to execute sql like:
{quote}
add file hdfs://xxx/user/test;
{quote}
if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an 
exception like:
{quote}
org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a directory 
and recursive is not turned on.
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372)
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
   at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117)
   at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
   at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
   at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
{quote}


> SparkSQL Added file get Exception: is a directory and recursive is not turned 
> on
> 
>
> Key: SPARK-16408
> URL: https://issues.apache.org/jira/browse/SPARK-16408
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: zenglinxi
>
> when use Spark-sql to execute sql like:
> {noformat}
> add file hdfs://xxx/user/test;
> {noformat}
> if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an 
> exception like:
> {noformat}
> org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a 
> directory and recursive is not turned on.
>at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372)
>at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
>at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on

2016-07-07 Thread zenglinxi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365658#comment-15365658
 ] 

zenglinxi edited comment on SPARK-16408 at 7/7/16 6:09 AM:
---

as shown in https://issues.apache.org/jira/browse/SPARK-4687, we have two 
functions in SparkContext.scala:
{noformat}
def addFile(path: String): Unit = {
addFile(path, false)
}
def addFile(path: String, recursive: Boolean): Unit = {
...
}
{noformat}
But there are no config to turn on or off recursive, and spark always call 
addFile(path) in default, which means the value of recursive is  false, this is 
why we get the exceptions.


was (Author: gostop_zlx):
as shown in https://issues.apache.org/jira/browse/SPARK-4687, we have two 
functions in SparkContext.scala:
{panel}
def addFile(path: String): Unit = {
addFile(path, false)
}
def addFile(path: String, recursive: Boolean): Unit = {
...
}
{panel}
But there are no config to turn on or off recursive, and spark always call 
addFile(path) in default, which means the value of recursive is  false, this is 
why we get the exceptions.

> SparkSQL Added file get Exception: is a directory and recursive is not turned 
> on
> 
>
> Key: SPARK-16408
> URL: https://issues.apache.org/jira/browse/SPARK-16408
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: zenglinxi
>
> when use Spark-sql to execute sql like:
> {quote}
> add file hdfs://xxx/user/test;
> {quote}
> if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an 
> exception like:
> {quote}
> org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a 
> directory and recursive is not turned on.
>at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372)
>at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
>at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on

2016-07-07 Thread zenglinxi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365658#comment-15365658
 ] 

zenglinxi edited comment on SPARK-16408 at 7/7/16 6:07 AM:
---

as shown in https://issues.apache.org/jira/browse/SPARK-4687, we have two 
functions in SparkContext.scala:
{panel}
def addFile(path: String): Unit = {
addFile(path, false)
}
def addFile(path: String, recursive: Boolean): Unit = {
...
}
{panel}
But there are no config to turn on or off recursive, and spark always call 
addFile(path) in default, which means the value of recursive is  false, this is 
why we get the exceptions.


was (Author: gostop_zlx):
as shown in https://issues.apache.org/jira/browse/SPARK-4687, we have two 
functions in SparkContext.scala:
{quote}
def addFile(path: String): Unit = {
addFile(path, false)
  }
def addFile(path: String, recursive: Boolean): Unit = {
...
}
{quote}
But there are no config to turn on or off recursive, and spark always call 
addFile(path) in default, which means the value of recursive is  false, this is 
why we get the exceptions.

> SparkSQL Added file get Exception: is a directory and recursive is not turned 
> on
> 
>
> Key: SPARK-16408
> URL: https://issues.apache.org/jira/browse/SPARK-16408
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: zenglinxi
>
> when use Spark-sql to execute sql like:
> {quote}
> add file hdfs://xxx/user/test;
> {quote}
> if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an 
> exception like:
> {quote}
> org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a 
> directory and recursive is not turned on.
>at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372)
>at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
>at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on

2016-07-06 Thread zenglinxi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365658#comment-15365658
 ] 

zenglinxi commented on SPARK-16408:
---

as shown in https://issues.apache.org/jira/browse/SPARK-4687, we have two 
functions in SparkContext.scala:
{quote}
def addFile(path: String): Unit = {
addFile(path, false)
  }
def addFile(path: String, recursive: Boolean): Unit = {
...
}
{quote}
But there are no config to turn on or off recursive, and spark always call 
addFile(path) in default, which means the value of recursive is  false, this is 
why we get the exceptions.

> SparkSQL Added file get Exception: is a directory and recursive is not turned 
> on
> 
>
> Key: SPARK-16408
> URL: https://issues.apache.org/jira/browse/SPARK-16408
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: zenglinxi
>
> when use Spark-sql to execute sql like:
> {quote}
> add file hdfs://xxx/user/test;
> {quote}
> if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an 
> exception like:
> {quote}
> org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a 
> directory and recursive is not turned on.
>at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372)
>at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
>at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16408) SparkSQL Added file get Exception: is a directory and recursive is not turned on

2016-07-06 Thread zenglinxi (JIRA)
zenglinxi created SPARK-16408:
-

 Summary: SparkSQL Added file get Exception: is a directory and 
recursive is not turned on
 Key: SPARK-16408
 URL: https://issues.apache.org/jira/browse/SPARK-16408
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.6.2
Reporter: zenglinxi


when use Spark-sql to execute sql like:
{quote}
add file hdfs://xxx/user/test;
{quote}
if the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an 
exception like:
{quote}
org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a directory 
and recursive is not turned on.
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372)
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
   at org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117)
   at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
   at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
   at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16253) make spark sql compatible with hive sql that using python script transform like using 'xxx.py'

2016-06-28 Thread zenglinxi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352917#comment-15352917
 ] 

zenglinxi commented on SPARK-16253:
---

I'm working on this issue...

> make spark sql compatible with hive sql that using python script transform 
> like using 'xxx.py'
> --
>
> Key: SPARK-16253
> URL: https://issues.apache.org/jira/browse/SPARK-16253
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: zenglinxi
>
> Some hive sql like:
> {quote}
> add file /tmp/spark_sql_test/test.py;
> select transform(cityname) using 'test.py' as (new_cityname) from 
> test.spark2_orc where dt='20160622' limit 5 ;
> {quote}
> can't be executed by spark sql directly, since it will return error like:
> {quote}
> 16/06/26 11:01:28 INFO codegen.GenerateUnsafeProjection: Code generated in 
> 19.054534 ms
> 16/06/26 11:01:28 ERROR execution.ScriptTransformationWriterThread: 
> /bin/bash: test.py: command not found
> {quote}
> and the sql works fine in hive with MR.
> Lots of ETL can't be moved from hive to spark sql because of this problem. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16253) make spark sql compatible with hive sql that using python script transform like using 'xxx.py'

2016-06-28 Thread zenglinxi (JIRA)
zenglinxi created SPARK-16253:
-

 Summary: make spark sql compatible with hive sql that using python 
script transform like using 'xxx.py'
 Key: SPARK-16253
 URL: https://issues.apache.org/jira/browse/SPARK-16253
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.6.2
Reporter: zenglinxi


Some hive sql like:
{quote}
add file /tmp/spark_sql_test/test.py;
select transform(cityname) using 'test.py' as (new_cityname) from 
test.spark2_orc where dt='20160622' limit 5 ;
{quote}
can't be executed by spark sql directly, since it will return error like:
{quote}
16/06/26 11:01:28 INFO codegen.GenerateUnsafeProjection: Code generated in 
19.054534 ms
16/06/26 11:01:28 ERROR execution.ScriptTransformationWriterThread: /bin/bash: 
test.py: command not found
{quote}
and the sql works fine in hive with MR.
Lots of ETL can't be moved from hive to spark sql because of this problem. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table

2016-04-28 Thread zenglinxi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262031#comment-15262031
 ] 

zenglinxi commented on SPARK-14974:
---

hi,[~sowen]:
Thanks for your reply. 
"200w" means 2,000,000. And about "Are you just saying this problem occurs when 
you have a typo in your code? ", yes it occurs when there are some problems in 
the code, but we can't keep every spark users from submitting jobs with 
'dangerous' code. The only thing we can do is finding and killing this type of 
jobs as soon as possible.
And since Hive (on MapReduce) can limit the max files created by mapreduce job 
with hive configuration parameter (ex.: 
hive.exec.max.dynamic.partitions.pernode, hive.exec.max.created.files), I think 
spark can(should) also do this. 

> spark sql job create too many files in HDFS when doing insert overwrite hive 
> table
> --
>
> Key: SPARK-14974
> URL: https://issues.apache.org/jira/browse/SPARK-14974
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: zenglinxi
>
> Recently, we often encounter problems using spark sql for inserting data into 
> a partition table (ex.: insert overwrite table $output_table partition(dt) 
> select xxx from tmp_table).  
> After the spark job start running on yarn, the app will create too many files 
> (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous 
> pressure.
> We found that the num of files created by spark job is depending on the 
> partition num of hive table that will be inserted and the num of spark sql 
> partitions. 
> files_num = hive_table_partions_num *  spark_sql_partitions_num.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
> 1000, and the hive_table_partions_num is very small under normal 
> circumstances, but it will turn out to be more than 2000 when we input a 
> wrong field as the partion field unconsciously, which will make the files_num 
> >= 1000 * 2000 = 2,000,000.
> There is a configuration parameter in hive that can limit the maximum number 
> of dynamic partitions allowed to be created in each mapper/reducer named 
> hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
> when we use hiveContext.
> Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
> files_num be smaller, but it will affect the concurrency.
> Can we create configuration parameters to  limit the maximum number of files 
> allowed to be create by each task or limit the spark_sql_partitions_num 
> without affect the concurrency?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table

2016-04-28 Thread zenglinxi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-14974:
--
Description: 
Recently, we often encounter problems using spark sql for inserting data into a 
partition table (ex.: insert overwrite table $output_table partition(dt) select 
xxx from tmp_table).  
After the spark job start running on yarn, the app will create too many files 
(ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous 
pressure.
We found that the num of files created by spark job is depending on the 
partition num of hive table that will be inserted and the num of spark sql 
partitions. 
files_num = hive_table_partions_num *  spark_sql_partitions_num.
We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
1000, and the hive_table_partions_num is very small under normal circumstances, 
but it will turn out to be more than 2000 when we input a wrong field as the 
partion field unconsciously, which will make the files_num >= 1000 * 2000 = 
2,000,000.

There is a configuration parameter in hive that can limit the maximum number of 
dynamic partitions allowed to be created in each mapper/reducer named 
hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
when we use hiveContext.

Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
files_num be smaller, but it will affect the concurrency.

Can we create configuration parameters to  limit the maximum number of files 
allowed to be create by each task or limit the spark_sql_partitions_num without 
affect the concurrency?

  was:
Recently, we often encounter problems using spark sql for inserting data into a 
partition table (ex.: insert overwrite table $output_table partition(dt) select 
xxx from tmp_table).  
After the spark job start running on yarn, the app will create too many files 
(ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous 
pressure.
We found that the num of files created by spark job is depending on the 
partition num of hive table that will be inserted and the num of spark sql 
partitions. 
files_num = hive_table_partions_num *  spark_sql_partitions_num.
We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
1000, and the hive_table_partions_num is very small under normal circumstances, 
but it will turn out to be more than 2000 when we input a wrong field as the 
partion field unconsciously, which will make the files_num >= 1000 * 2000 = 
2,000,000.

There is a configuration parameter in hive that can limit the maximum number of 
dynamic partitions allowed to be created in each mapper/reducer named 
hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
when we use hiveContext.

Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
files_num be smaller, but it will affect the concurrency.


> spark sql job create too many files in HDFS when doing insert overwrite hive 
> table
> --
>
> Key: SPARK-14974
> URL: https://issues.apache.org/jira/browse/SPARK-14974
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: zenglinxi
>
> Recently, we often encounter problems using spark sql for inserting data into 
> a partition table (ex.: insert overwrite table $output_table partition(dt) 
> select xxx from tmp_table).  
> After the spark job start running on yarn, the app will create too many files 
> (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous 
> pressure.
> We found that the num of files created by spark job is depending on the 
> partition num of hive table that will be inserted and the num of spark sql 
> partitions. 
> files_num = hive_table_partions_num *  spark_sql_partitions_num.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
> 1000, and the hive_table_partions_num is very small under normal 
> circumstances, but it will turn out to be more than 2000 when we input a 
> wrong field as the partion field unconsciously, which will make the files_num 
> >= 1000 * 2000 = 2,000,000.
> There is a configuration parameter in hive that can limit the maximum number 
> of dynamic partitions allowed to be created in each mapper/reducer named 
> hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
> when we use hiveContext.
> Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
> files_num be smaller, but it will affect the concurrency.
> Can we create configuration parameters to  limit the maximum number of files 
> allowed to be create by each task or limit the spark_sql_partitions_num 
> without affect the concurrency?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table

2016-04-28 Thread zenglinxi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-14974:
--
Description: 
Recently, we often encounter problems using spark sql for inserting data into a 
partition table (ex.: insert overwrite table $output_table partition(dt) select 
xxx from tmp_table).  
After the spark job start running on yarn, the app will create too many files 
(ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous 
pressure.
We found that the num of files created by spark job is depending on the 
partition num of hive table that will be inserted and the num of spark sql 
partitions. 
files_num = hive_table_partions_num *  spark_sql_partitions_num.
We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
1000, and the hive_table_partions_num is very small under normal circumstances, 
but it will turn out to be more than 2000 when we input a wrong field as the 
partion field unconsciously, which will make the files_num >= 1000 * 2000 = 
2,000,000.

There is a configuration parameter in hive that can limit the maximum number of 
dynamic partitions allowed to be created in each mapper/reducer named 
hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
when we use hiveContext.

Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
files_num be smaller, but it will affect the concurrency.

  was:
Recently, we often encounter problems using spark sql for inserting data into a 
partition table (ex.: insert overwrite table $output_table partition(dt) select 
xxx from tmp_table).  
After the spark job start running on yarn, the app will create too many files 
(ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure.
We found that the num of files created by spark job is depending on the 
partition num of hive table that will be inserted and the num of spark sql 
partitions. 
files_num = hive_table_partions_num *  spark_sql_partitions_num.
We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
1000, and the hive_table_partions_num is very small under normal circumstances, 
but it will turn out to be more than 2000 when we input a wrong field as the 
partion field unconsciously, which will make the files_num >= 1000 * 2000 = 
200w.

There is a configuration parameter in hive that can limit the maximum number of 
dynamic partitions allowed to be created in each mapper/reducer named 
hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
when we use hiveContext.

Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
files_num be smaller, but it will affect the concurrency.


> spark sql job create too many files in HDFS when doing insert overwrite hive 
> table
> --
>
> Key: SPARK-14974
> URL: https://issues.apache.org/jira/browse/SPARK-14974
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: zenglinxi
>
> Recently, we often encounter problems using spark sql for inserting data into 
> a partition table (ex.: insert overwrite table $output_table partition(dt) 
> select xxx from tmp_table).  
> After the spark job start running on yarn, the app will create too many files 
> (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous 
> pressure.
> We found that the num of files created by spark job is depending on the 
> partition num of hive table that will be inserted and the num of spark sql 
> partitions. 
> files_num = hive_table_partions_num *  spark_sql_partitions_num.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
> 1000, and the hive_table_partions_num is very small under normal 
> circumstances, but it will turn out to be more than 2000 when we input a 
> wrong field as the partion field unconsciously, which will make the files_num 
> >= 1000 * 2000 = 2,000,000.
> There is a configuration parameter in hive that can limit the maximum number 
> of dynamic partitions allowed to be created in each mapper/reducer named 
> hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
> when we use hiveContext.
> Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
> files_num be smaller, but it will affect the concurrency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table

2016-04-28 Thread zenglinxi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-14974:
--
Summary: spark sql job create too many files in HDFS when doing insert 
overwrite hive table  (was: spark sql insert hive table write too many files)

> spark sql job create too many files in HDFS when doing insert overwrite hive 
> table
> --
>
> Key: SPARK-14974
> URL: https://issues.apache.org/jira/browse/SPARK-14974
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: zenglinxi
>
> Recently, we often encounter problems using spark sql for inserting data into 
> a partition table (ex.: insert overwrite table $output_table partition(dt) 
> select xxx from tmp_table).  
> After the spark job start running on yarn, the app will create too many files 
> (ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure.
> We found that the num of files created by spark job is depending on the 
> partition num of hive table that will be inserted and the num of spark sql 
> partitions. 
> files_num = hive_table_partions_num *  spark_sql_partitions_num.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
> 1000, and the hive_table_partions_num is very small under normal 
> circumstances, but it will turn out to be more than 2000 when we input a 
> wrong field as the partion field unconsciously, which will make the files_num 
> >= 1000 * 2000 = 200w.
> There is a configuration parameter in hive that can limit the maximum number 
> of dynamic partitions allowed to be created in each mapper/reducer named 
> hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
> when we use hiveContext.
> Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
> files_num be smaller, but it will affect the concurrency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14974) spark sql insert hive table write too many files

2016-04-28 Thread zenglinxi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-14974:
--
Description: 
Recently, we often encounter problems using spark sql for inserting data into a 
partition table (ex.: insert overwrite table $output_table partition(dt) select 
xxx from tmp_table).  
After the spark job start running on yarn, the app will create too many files 
(ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure.
We found that the num of files created by spark job is depending on the 
partition num of hive table that will be inserted and the num of spark sql 
partitions. 
files_num = hive_table_partions_num *  spark_sql_partitions_num.
We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
1000, and the hive_table_partions_num is very small under normal circumstances, 
but it will turn out to be more than 2000 when we input a wrong field as the 
partion field unconsciously, which will make the files_num >= 1000 * 2000 = 
200w.

There is a configuration parameter in hive that can limit the maximum number of 
dynamic partitions allowed to be created in each mapper/reducer named 
hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
when we use hiveContext.

Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
files_num be smaller, but it will affect the concurrency.

  was:
Recently, we often encounter problems using spark sql for inserting data into a 
partition table (ex.: insert overwrite table $output_table partition(dt) select 
xxx from tmp_table).  
After the spark job start running on yarn, the app will create too many files 
(ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure.
We found that the num of files created by spark job is depending on the 
partition num of hive table that will be inserted and the num of spark sql 
partitions. 
files_num = hive_table_partions_num *  spark_sql_partitions_num.
We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
1000, and the hive_table_partions_num is very small under normal circumstances, 
but it will turn out to be more than 2000 when we input a wrong field as the 
partion field unconsciously, which will make the files_num >= 1000 * 2000 = 
200w.


> spark sql insert hive table write too many files
> 
>
> Key: SPARK-14974
> URL: https://issues.apache.org/jira/browse/SPARK-14974
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: zenglinxi
>
> Recently, we often encounter problems using spark sql for inserting data into 
> a partition table (ex.: insert overwrite table $output_table partition(dt) 
> select xxx from tmp_table).  
> After the spark job start running on yarn, the app will create too many files 
> (ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure.
> We found that the num of files created by spark job is depending on the 
> partition num of hive table that will be inserted and the num of spark sql 
> partitions. 
> files_num = hive_table_partions_num *  spark_sql_partitions_num.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
> 1000, and the hive_table_partions_num is very small under normal 
> circumstances, but it will turn out to be more than 2000 when we input a 
> wrong field as the partion field unconsciously, which will make the files_num 
> >= 1000 * 2000 = 200w.
> There is a configuration parameter in hive that can limit the maximum number 
> of dynamic partitions allowed to be created in each mapper/reducer named 
> hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
> when we use hiveContext.
> Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
> files_num be smaller, but it will affect the concurrency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14974) spark sql insert hive table write too many files

2016-04-28 Thread zenglinxi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-14974:
--
Description: 
Recently, we often encounter problems using spark sql for inserting data into a 
partition table (ex.: insert overwrite table $output_table partition(dt) select 
xxx from tmp_table).  
After the spark job start running on yarn, the app will create too many files 
(ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure.
We found that the num of files created by spark job is depending on the 
partition num of hive table that will be inserted and the num of spark sql 
partitions. 
files_num = hive_table_partions_num *  spark_sql_partitions_num.
We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
1000, and the hive_table_partions_num is very small under normal circumstances, 
but it will turn out to be more than 2000 when we input a wrong field as the 
partion field unconsciously, which will make the files_num >= 1000 * 2000 = 
200w.

> spark sql insert hive table write too many files
> 
>
> Key: SPARK-14974
> URL: https://issues.apache.org/jira/browse/SPARK-14974
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: zenglinxi
>
> Recently, we often encounter problems using spark sql for inserting data into 
> a partition table (ex.: insert overwrite table $output_table partition(dt) 
> select xxx from tmp_table).  
> After the spark job start running on yarn, the app will create too many files 
> (ex. 200w+, or even 1000w+), which will make HDFS under enormous pressure.
> We found that the num of files created by spark job is depending on the 
> partition num of hive table that will be inserted and the num of spark sql 
> partitions. 
> files_num = hive_table_partions_num *  spark_sql_partitions_num.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
> 1000, and the hive_table_partions_num is very small under normal 
> circumstances, but it will turn out to be more than 2000 when we input a 
> wrong field as the partion field unconsciously, which will make the files_num 
> >= 1000 * 2000 = 200w.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14974) spark sql insert hive table write too many files

2016-04-28 Thread zenglinxi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-14974:
--
Summary: spark sql insert hive table write too many files  (was: spark sql 
insert hive table write too much files)

> spark sql insert hive table write too many files
> 
>
> Key: SPARK-14974
> URL: https://issues.apache.org/jira/browse/SPARK-14974
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: zenglinxi
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14974) spark sql insert hive table write too much files

2016-04-28 Thread zenglinxi (JIRA)
zenglinxi created SPARK-14974:
-

 Summary: spark sql insert hive table write too much files
 Key: SPARK-14974
 URL: https://issues.apache.org/jira/browse/SPARK-14974
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.5.2
Reporter: zenglinxi






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org