[jira] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

2023-11-28 Thread jeanlyn (Jira)


[ https://issues.apache.org/jira/browse/SPARK-43106 ]


jeanlyn deleted comment on SPARK-43106:
-

was (Author: jeanlyn):
I think we also encountered similar problems, we circumvent this problem by 
using parameters *spark.sql.hive.convertInsertingPartitionedTable=false*

> Data lost from the table if the INSERT OVERWRITE query fails
> 
>
> Key: SPARK-43106
> URL: https://issues.apache.org/jira/browse/SPARK-43106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vaibhav Beriwala
>Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data 
> from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then 
> table data is unavailable to other readers for the entire duration the job 
> takes.
> This behavior is the same even for the partitioned tables when using static 
> partitioning. For dynamic partitioning, we do not delete the table data 
> before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and 
> not as part of the Job commit operation? This issue is not there with Hive - 
> where the data is cleaned up as part of the Job commit operation probably. As 
> part of SPARK-19183, we did add a new hook in the commit protocol for this 
> exact same purpose, but seems like its default behavior is still to delete 
> the table data before the job launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43106) Data lost from the table if the INSERT OVERWRITE query fails

2023-11-27 Thread jeanlyn (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790135#comment-17790135
 ] 

jeanlyn commented on SPARK-43106:
-

I think we also encountered similar problems, we circumvent this problem by 
using parameters *spark.sql.hive.convertInsertingPartitionedTable=false*

> Data lost from the table if the INSERT OVERWRITE query fails
> 
>
> Key: SPARK-43106
> URL: https://issues.apache.org/jira/browse/SPARK-43106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vaibhav Beriwala
>Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data 
> from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then 
> table data is unavailable to other readers for the entire duration the job 
> takes.
> This behavior is the same even for the partitioned tables when using static 
> partitioning. For dynamic partitioning, we do not delete the table data 
> before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and 
> not as part of the Job commit operation? This issue is not there with Hive - 
> where the data is cleaned up as part of the Job commit operation probably. As 
> part of SPARK-19183, we did add a new hook in the commit protocol for this 
> exact same purpose, but seems like its default behavior is still to delete 
> the table data before the job launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35635) concurrent insert statements from multiple beeline fail with job aborted exception

2023-07-08 Thread jeanlyn (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741309#comment-17741309
 ] 

jeanlyn commented on SPARK-35635:
-

When the tasks are running concurrently, the "_temporary" will be attempted to 
be deleted multiple times, which may result in job failure. Is it more 
appropriate to reopen this issue? [~gurwls223] 

> concurrent insert statements from multiple beeline fail with job aborted 
> exception
> --
>
> Key: SPARK-35635
> URL: https://issues.apache.org/jira/browse/SPARK-35635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
>Reporter: Chetan Bhat
>Priority: Minor
>
> Create tables - 
> CREATE TABLE J1_TBL (
>  i integer,
>  j integer,
>  t string
> ) USING parquet;
> CREATE TABLE J2_TBL (
>  i integer,
>  k integer
> ) USING parquet;
> From 4 concurrent beeline sessions execute the insert into select queries - 
> INSERT INTO J1_TBL VALUES (1, 4, 'one');
> INSERT INTO J1_TBL VALUES (2, 3, 'two');
> INSERT INTO J1_TBL VALUES (3, 2, 'three');
> INSERT INTO J1_TBL VALUES (4, 1, 'four');
> INSERT INTO J1_TBL VALUES (5, 0, 'five');
> INSERT INTO J1_TBL VALUES (6, 6, 'six');
> INSERT INTO J1_TBL VALUES (7, 7, 'seven');
> INSERT INTO J1_TBL VALUES (8, 8, 'eight');
> INSERT INTO J1_TBL VALUES (0, NULL, 'zero');
> INSERT INTO J1_TBL VALUES (NULL, NULL, 'null');
> INSERT INTO J1_TBL VALUES (NULL, 0, 'zero');
> INSERT INTO J2_TBL VALUES (1, -1);
> INSERT INTO J2_TBL VALUES (2, 2);
> INSERT INTO J2_TBL VALUES (3, -3);
> INSERT INTO J2_TBL VALUES (2, 4);
> INSERT INTO J2_TBL VALUES (5, -5);
> INSERT INTO J2_TBL VALUES (5, -5);
> INSERT INTO J2_TBL VALUES (0, NULL);
> INSERT INTO J2_TBL VALUES (NULL, NULL);
> INSERT INTO J2_TBL VALUES (NULL, 0);
>  
> Issue : concurrent insert statements from multiple beeline fail with job 
> aborted exception.
> 0: jdbc:hive2://10.19.89.222:23040/> INSERT INTO J1_TBL VALUES (8, 8, 
> 'eight');
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.SparkException: Job aborted.
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:366)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:263)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3$$Lambda$1781/750578465.apply$mcV$sp(Unknown
>  Source)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:45)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:263)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:258)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:272)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Job aborted.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:109)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:107)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:121)
>  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
>  at 

[jira] [Commented] (SPARK-35635) concurrent insert statements from multiple beeline fail with job aborted exception

2023-07-08 Thread jeanlyn (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741296#comment-17741296
 ] 

jeanlyn commented on SPARK-35635:
-

We encounter the same issue when concurrent writing in deference partition on 
same table.

> concurrent insert statements from multiple beeline fail with job aborted 
> exception
> --
>
> Key: SPARK-35635
> URL: https://issues.apache.org/jira/browse/SPARK-35635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
>Reporter: Chetan Bhat
>Priority: Minor
>
> Create tables - 
> CREATE TABLE J1_TBL (
>  i integer,
>  j integer,
>  t string
> ) USING parquet;
> CREATE TABLE J2_TBL (
>  i integer,
>  k integer
> ) USING parquet;
> From 4 concurrent beeline sessions execute the insert into select queries - 
> INSERT INTO J1_TBL VALUES (1, 4, 'one');
> INSERT INTO J1_TBL VALUES (2, 3, 'two');
> INSERT INTO J1_TBL VALUES (3, 2, 'three');
> INSERT INTO J1_TBL VALUES (4, 1, 'four');
> INSERT INTO J1_TBL VALUES (5, 0, 'five');
> INSERT INTO J1_TBL VALUES (6, 6, 'six');
> INSERT INTO J1_TBL VALUES (7, 7, 'seven');
> INSERT INTO J1_TBL VALUES (8, 8, 'eight');
> INSERT INTO J1_TBL VALUES (0, NULL, 'zero');
> INSERT INTO J1_TBL VALUES (NULL, NULL, 'null');
> INSERT INTO J1_TBL VALUES (NULL, 0, 'zero');
> INSERT INTO J2_TBL VALUES (1, -1);
> INSERT INTO J2_TBL VALUES (2, 2);
> INSERT INTO J2_TBL VALUES (3, -3);
> INSERT INTO J2_TBL VALUES (2, 4);
> INSERT INTO J2_TBL VALUES (5, -5);
> INSERT INTO J2_TBL VALUES (5, -5);
> INSERT INTO J2_TBL VALUES (0, NULL);
> INSERT INTO J2_TBL VALUES (NULL, NULL);
> INSERT INTO J2_TBL VALUES (NULL, 0);
>  
> Issue : concurrent insert statements from multiple beeline fail with job 
> aborted exception.
> 0: jdbc:hive2://10.19.89.222:23040/> INSERT INTO J1_TBL VALUES (8, 8, 
> 'eight');
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.SparkException: Job aborted.
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:366)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:263)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3$$Lambda$1781/750578465.apply$mcV$sp(Unknown
>  Source)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:45)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:263)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:258)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:272)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Job aborted.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:109)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:107)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:121)
>  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
>  at org.apache.spark.sql.Dataset$$Lambda$1650/1168893915.apply(Unknown Source)
>  at 

[jira] [Commented] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2023-06-20 Thread jeanlyn (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735519#comment-17735519
 ] 

jeanlyn commented on SPARK-38230:
-

We found Hive metastore crash frequently after upgrade Spark from 2.4.7 to 
3.3.2. After investigation, I found `InsertIntoHadoopFsRelationCommand` will 
pull all partitions when using dynamicPartitionOverwrite, and i find this issue 
after solves the problem by using generate paths to get partitions to get 
partitions in our environment. So, I have submitted a new pull request, hoping 
to help you.

> InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions 
> in most cases
> ---
>
> Key: SPARK-38230
> URL: https://issues.apache.org/jira/browse/SPARK-38230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.3.0, 3.4.0, 3.5.0
>Reporter: Coal Chan
>Priority: Major
>
> In 
> `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`,
>  `sparkSession.sessionState.catalog.listPartitions` will call method 
> `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore 
> client, this method will produce multiple queries per partition on hive 
> metastore db. So when you insert into a table which has too many 
> partitions(ie: 10k), it will produce too many queries on hive metastore 
> db(ie: n * 10k = 10nk), it puts a lot of strain on the database.
> In fact, it calls method `listPartitions` in order to get locations of 
> partitions and get `customPartitionLocations`. But in most cases, we do not 
> have custom partitions, we can just get partition names, so we can call 
> method listPartitionNames.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14243) updatedBlockStatuses does not update correctly when removing blocks

2016-03-29 Thread jeanlyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216229#comment-15216229
 ] 

jeanlyn commented on SPARK-14243:
-

[~andrewor14]  Let me know if the descriptions does not detail enough. Also, I 
will try to fix it these day. :-)

> updatedBlockStatuses does not update correctly when removing blocks
> ---
>
> Key: SPARK-14243
> URL: https://issues.apache.org/jira/browse/SPARK-14243
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: jeanlyn
>
> Currently, *updatedBlockStatuses* of *TaskMetrics* does not update correctly 
> when removing blocks in *BlockManager.removeBlock* and the method invoke 
> *removeBlock*. See:
> branch-1.6:https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1108
> branch-1.5:https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1101
> We should make sure *updatedBlockStatuses* update correctly when:
> * Block removed from BlockManager
> * Block dropped from memory to disk
> * Block added to BlockManager



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14243) updatedBlockStatuses does not update correctly when removing blocks

2016-03-29 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn updated SPARK-14243:

Summary: updatedBlockStatuses does not update correctly when removing 
blocks  (was: updatedBlockStatuses does not update correctly )

> updatedBlockStatuses does not update correctly when removing blocks
> ---
>
> Key: SPARK-14243
> URL: https://issues.apache.org/jira/browse/SPARK-14243
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: jeanlyn
>
> Currently, *updatedBlockStatuses* of *TaskMetrics* does not update correctly 
> when removing blocks in *BlockManager.removeBlock* and the method invoke 
> *removeBlock*. See:
> branch-1.6:https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1108
> branch-1.5:https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1101
> We should make sure *updatedBlockStatuses* update correctly when:
> * Block removed from BlockManager
> * Block dropped from memory to disk
> * Block added to BlockManager



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14243) updatedBlockStatuses does not update correctly

2016-03-29 Thread jeanlyn (JIRA)
jeanlyn created SPARK-14243:
---

 Summary: updatedBlockStatuses does not update correctly 
 Key: SPARK-14243
 URL: https://issues.apache.org/jira/browse/SPARK-14243
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1, 1.5.2
Reporter: jeanlyn


Currently, *updatedBlockStatuses* of *TaskMetrics* does not update correctly 
when removing blocks in *BlockManager.removeBlock* and the method invoke 
*removeBlock*. See:
branch-1.6:https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1108
branch-1.5:https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1101
We should make sure *updatedBlockStatuses* update correctly when:
* Block removed from BlockManager
* Block dropped from memory to disk
* Block added to BlockManager



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13845) BlockStatus and StreamBlockId keep on growing result driver OOM

2016-03-13 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn updated SPARK-13845:

Summary: BlockStatus and StreamBlockId keep on growing result driver OOM  
(was: Driver OOM after few days when running streaming)

> BlockStatus and StreamBlockId keep on growing result driver OOM
> ---
>
> Key: SPARK-13845
> URL: https://issues.apache.org/jira/browse/SPARK-13845
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.1
>Reporter: jeanlyn
>
> We have a streaming job using *FlumePollInputStream* always driver OOM after 
> few days, here is some driver heap dump before OOM
> {noformat}
>  num #instances #bytes  class name
> --
>1:  13845916  553836640  org.apache.spark.storage.BlockStatus
>2:  14020324  336487776  org.apache.spark.storage.StreamBlockId
>3:  13883881  333213144  scala.collection.mutable.DefaultEntry
>4:  8907   89043952  [Lscala.collection.mutable.HashEntry;
>5: 62360   65107352  [B
>6:163368   24453904  [Ljava.lang.Object;
>7:293651   20342664  [C
> ...
> {noformat}
> *BlockStatus* and *StreamBlockId* keep on growing, and the driver OOM in the 
> end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13845) Driver OOM after few days when running streaming

2016-03-13 Thread jeanlyn (JIRA)
jeanlyn created SPARK-13845:
---

 Summary: Driver OOM after few days when running streaming
 Key: SPARK-13845
 URL: https://issues.apache.org/jira/browse/SPARK-13845
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1, 1.5.2
Reporter: jeanlyn


We have a streaming job using *FlumePollInputStream* always driver OOM after 
few days, here is some driver heap dump before OOM
{noformat}
 num #instances #bytes  class name
--
   1:  13845916  553836640  org.apache.spark.storage.BlockStatus
   2:  14020324  336487776  org.apache.spark.storage.StreamBlockId
   3:  13883881  333213144  scala.collection.mutable.DefaultEntry
   4:  8907   89043952  [Lscala.collection.mutable.HashEntry;
   5: 62360   65107352  [B
   6:163368   24453904  [Ljava.lang.Object;
   7:293651   20342664  [C
...
{noformat}
*BlockStatus* and *StreamBlockId* keep on growing, and the driver OOM in the 
end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13586) add config to skip generate down time batch when restart StreamingContext

2016-03-01 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn closed SPARK-13586.
---
Resolution: Invalid

> add config to skip generate down time batch when restart StreamingContext
> -
>
> Key: SPARK-13586
> URL: https://issues.apache.org/jira/browse/SPARK-13586
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: jeanlyn
>Priority: Minor
>
> If we restart streaming, which using checkpoint and has stopped for hours, it 
> will generate a lot of batch to the queue, and it need to take a while to 
> handle this batches. So i propose to add a config to control whether generate 
> down time batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13586) add config to skip generate down time batch when restart StreamingContext

2016-02-29 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn updated SPARK-13586:

Priority: Minor  (was: Major)

> add config to skip generate down time batch when restart StreamingContext
> -
>
> Key: SPARK-13586
> URL: https://issues.apache.org/jira/browse/SPARK-13586
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: jeanlyn
>Priority: Minor
>
> If we restart streaming, which using checkpoint and has stopped for hours, it 
> will generate a lot of batch to the queue, and it need to take a while to 
> handle this batches. So i propose to add a config to control whether generate 
> down time batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13586) add config to skip generate down time batch when restart StreamingContext

2016-02-29 Thread jeanlyn (JIRA)
jeanlyn created SPARK-13586:
---

 Summary: add config to skip generate down time batch when restart 
StreamingContext
 Key: SPARK-13586
 URL: https://issues.apache.org/jira/browse/SPARK-13586
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.6.0
Reporter: jeanlyn


If we restart streaming, which using checkpoint and has stopped for hours, it 
will generate a lot of batch to the queue, and it need to take a while to 
handle this batches. So i propose to add a config to control whether generate 
down time batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13356) WebUI missing input informations when recovering from dirver failure

2016-02-16 Thread jeanlyn (JIRA)
jeanlyn created SPARK-13356:
---

 Summary: WebUI missing input informations when recovering from 
dirver failure
 Key: SPARK-13356
 URL: https://issues.apache.org/jira/browse/SPARK-13356
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.0, 1.5.2, 1.5.1, 1.5.0
Reporter: jeanlyn


WebUI missing some input information when streaming recover from checkpoint, it 
may confuse people the data had lose when recover from failure.
For example:
!DirectKafkaScreenshot.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13356) WebUI missing input informations when recovering from dirver failure

2016-02-16 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn updated SPARK-13356:

Attachment: DirectKafkaScreenshot.jpg

> WebUI missing input informations when recovering from dirver failure
> 
>
> Key: SPARK-13356
> URL: https://issues.apache.org/jira/browse/SPARK-13356
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0
>Reporter: jeanlyn
> Attachments: DirectKafkaScreenshot.jpg
>
>
> WebUI missing some input information when streaming recover from checkpoint, 
> it may confuse people the data had lose when recover from failure.
> For example:
> !DirectKafkaScreenshot.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition

2015-08-17 Thread jeanlyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699117#comment-14699117
 ] 

jeanlyn commented on SPARK-8513:


I think i encountered the problem these day. Our job failed due to 
{{_temporary}} left, and when using the hive api update partitions will throw 
exception if it has nested directory.
{noformat}
2015-08-12 07:07:07 INFO org.apache.hadoop.hive.ql.metadata.HiveException: 
checkPaths: 
hdfs://ns1/tmp/hive-dd_edw/hive_2015-08-12_07-02-20_902_7762418154833191311-1/-ext-1
 has nested 
directoryhdfs://ns1/tmp/hive-dd_edw/hive_2015-08-12_07-02-20_902_7762418154833191311-1/-ext-1/_temporary
at org.apache.hadoop.hive.ql.metadata.Hive.checkPaths(Hive.java:2080)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2270)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1222)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:233)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:124)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:266)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1140)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1140)
at org.apache.spark.sql.DataFrame.init(DataFrame.scala:147)
at org.apache.spark.sql.DataFrame.init(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:97)
at 
org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:273)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:507)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:442)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:148)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:619)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{noformat}

In our case, all of our task had finished except the speculative task.
{code}
15/08/12 07:07:06 INFO TaskSetManager: Marking task 6 in stage 40.0 (on 
BJHC-HERA-17163.hadoop.local) as speculatable because it ran more than 33639 ms

(speculative task)**15/08/12 07:07:06 INFO TaskSetManager: Starting task 
6.1 in stage 40.0 (TID 165, BJHC-HERA-16580.hadoop.local, PROCESS_LOCAL, 1687 
bytes)*

15/08/12 07:07:06 INFO BlockManagerInfo: Added broadcast_60_piece0 in memory on 
BJHC-HERA-16580.hadoop.local:48182 (size: 740.2 KB, free: 2.1 GB)
15/08/12 07:07:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map output 
locations for shuffle 1 to BJHC-HERA-16580.hadoop.local:9208
15/08/12 07:07:07 INFO TaskSetManager: Finished task 6.0 in stage 40.0 (TID 
161) in 34449 ms on BJHC-HERA-17163.hadoop.local (10/10)
15/08/12 07:07:07 INFO DAGScheduler: ResultStage 40 (runJob at 
InsertIntoHiveTable.scala:83) finished in 34.457 s
{code}
However, i can not find any code to cancel the speculative task. So, if we want 
to fix this issue, do we also need to add the cancel logic(kill the speculative 
tasks) before making task cancellation synchronous when job finished?


 _temporary may be left undeleted when a write job committed with 
 FileOutputCommitter fails due to a race condition
 --

 Key: SPARK-8513
 URL: https://issues.apache.org/jira/browse/SPARK-8513
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.2.2, 1.3.1, 1.4.0
Reporter: Cheng Lian

 To reproduce this issue, we need a node with relatively more cores, say 32 
 (e.g., Spark Jenkins builder is a good candidate).  With such a node, the 
 following code should be relatively easy to reproduce this issue:
 {code}
 sqlContext.range(0, 10).repartition(32).select('id / 
 0).write.mode(overwrite).parquet(file:///tmp/foo)
 {code}
 You may observe similar log lines as below:
 

[jira] [Comment Edited] (SPARK-8513) _temporary may be left undeleted when a write job committed with FileOutputCommitter fails due to a race condition

2015-08-17 Thread jeanlyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699117#comment-14699117
 ] 

jeanlyn edited comment on SPARK-8513 at 8/17/15 7:30 AM:
-

I think i encountered the problem these day. Our job failed due to 
{{_temporary}} left, and when using the hive api update partitions will throw 
exception if it has nested directory.
{noformat}
2015-08-12 07:07:07 INFO org.apache.hadoop.hive.ql.metadata.HiveException: 
checkPaths: 
hdfs://ns1/tmp/hive-dd_edw/hive_2015-08-12_07-02-20_902_7762418154833191311-1/-ext-1
 has nested 
directoryhdfs://ns1/tmp/hive-dd_edw/hive_2015-08-12_07-02-20_902_7762418154833191311-1/-ext-1/_temporary
at org.apache.hadoop.hive.ql.metadata.Hive.checkPaths(Hive.java:2080)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2270)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1222)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:233)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:124)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:266)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1140)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1140)
at org.apache.spark.sql.DataFrame.init(DataFrame.scala:147)
at org.apache.spark.sql.DataFrame.init(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:97)
at 
org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:273)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:507)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:442)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:148)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:619)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{noformat}

In our case, all of our task had finished except the speculative task.
{noformat}
15/08/12 07:07:06 INFO TaskSetManager: Marking task 6 in stage 40.0 (on 
BJHC-HERA-17163.hadoop.local) as speculatable because it ran more than 33639 ms

(speculative task)**15/08/12 07:07:06 INFO TaskSetManager: Starting task 
6.1 in stage 40.0 (TID 165, BJHC-HERA-16580.hadoop.local, PROCESS_LOCAL, 1687 
bytes)*

15/08/12 07:07:06 INFO BlockManagerInfo: Added broadcast_60_piece0 in memory on 
BJHC-HERA-16580.hadoop.local:48182 (size: 740.2 KB, free: 2.1 GB)
15/08/12 07:07:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map output 
locations for shuffle 1 to BJHC-HERA-16580.hadoop.local:9208
15/08/12 07:07:07 INFO TaskSetManager: Finished task 6.0 in stage 40.0 (TID 
161) in 34449 ms on BJHC-HERA-17163.hadoop.local (10/10)
15/08/12 07:07:07 INFO DAGScheduler: ResultStage 40 (runJob at 
InsertIntoHiveTable.scala:83) finished in 34.457 s
{noformat}
However, i can not find any code to cancel the speculative task. So, if we want 
to fix this issue, do we also need to add the cancel logic(kill the speculative 
tasks) before making task cancellation synchronous when job finished?



was (Author: jeanlyn):
I think i encountered the problem these day. Our job failed due to 
{{_temporary}} left, and when using the hive api update partitions will throw 
exception if it has nested directory.
{noformat}
2015-08-12 07:07:07 INFO org.apache.hadoop.hive.ql.metadata.HiveException: 
checkPaths: 
hdfs://ns1/tmp/hive-dd_edw/hive_2015-08-12_07-02-20_902_7762418154833191311-1/-ext-1
 has nested 
directoryhdfs://ns1/tmp/hive-dd_edw/hive_2015-08-12_07-02-20_902_7762418154833191311-1/-ext-1/_temporary
at org.apache.hadoop.hive.ql.metadata.Hive.checkPaths(Hive.java:2080)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2270)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1222)
at 

[jira] [Commented] (SPARK-6392) [SQL]class not found exception thows when `add jar` use spark cli

2015-08-08 Thread jeanlyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14663264#comment-14663264
 ] 

jeanlyn commented on SPARK-6392:


I thought it fixed in my case, do you have more descriptions about the issue, 
or how can we reproduce it?

 [SQL]class not found exception thows when `add jar` use spark cli 
 --

 Key: SPARK-6392
 URL: https://issues.apache.org/jira/browse/SPARK-6392
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn
Priority: Minor

 When we use spark cli to add jar dynamic,we will get the 
 *java.lang.ClassNotFoundException* when we use the class of jar to create 
 udf.For example:
 {noformat}
 spark-sql add jar /home/jeanlyn/hello.jar;
 spark-sqlcreate temporary function hello as 'hello';
 spark-sqlselect hello(name) from person;
 Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most 
 recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): 
 java.lang.ClassNotFoundException: hello
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9591) Job failed for exception during getting Broadcast variable

2015-08-04 Thread jeanlyn (JIRA)
jeanlyn created SPARK-9591:
--

 Summary: Job failed for exception during getting Broadcast variable
 Key: SPARK-9591
 URL: https://issues.apache.org/jira/browse/SPARK-9591
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1, 1.4.0, 1.3.1
Reporter: jeanlyn


Job might failed for exception throw when  we  getting the broadcast variable 
especially using dynamic resource allocate.
driver log
{noformat}
2015-07-21 05:36:31 INFO 15/07/21 05:36:31 WARN TaskSetManager: Lost task 496.1 
in stage 19.0 (TID 1715, XX): java.io.IOException: Failed to connect to 
X:27072
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.ConnectException: Connection refused: xx:27072
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
... 1 more
15/07/21 05:36:32 WARN TaskSetManager: Lost task 496.2 in stage 19.0 (TID 1744, 
x): java.io.IOException: Failed to connect to /:34070
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.ConnectException: Connection refused: xxx:34070
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
... 1 more

org.apache.spark.SparkException: Job aborted due to stage failure: Task 496 in 
stage 19.0 failed 4 times
{noformat}

executor log
{noformat}
15/07/21 05:36:17 ERROR shuffle.RetryingBlockFetcher: Exception while beginning 
fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to xxx
at 

[jira] [Closed] (SPARK-6392) [SQL]class not found exception thows when `add jar` use spark cli

2015-07-19 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn closed SPARK-6392.
--
Resolution: Fixed

 [SQL]class not found exception thows when `add jar` use spark cli 
 --

 Key: SPARK-6392
 URL: https://issues.apache.org/jira/browse/SPARK-6392
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn
Priority: Minor

 When we use spark cli to add jar dynamic,we will get the 
 *java.lang.ClassNotFoundException* when we use the class of jar to create 
 udf.For example:
 {noformat}
 spark-sql add jar /home/jeanlyn/hello.jar;
 spark-sqlcreate temporary function hello as 'hello';
 spark-sqlselect hello(name) from person;
 Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most 
 recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): 
 java.lang.ClassNotFoundException: hello
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6392) [SQL]class not found exception thows when `add jar` use spark cli

2015-07-19 Thread jeanlyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633033#comment-14633033
 ] 

jeanlyn commented on SPARK-6392:


I think this issue is fixed by https://github.com/apache/spark/pull/4586.

 [SQL]class not found exception thows when `add jar` use spark cli 
 --

 Key: SPARK-6392
 URL: https://issues.apache.org/jira/browse/SPARK-6392
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn
Priority: Minor

 When we use spark cli to add jar dynamic,we will get the 
 *java.lang.ClassNotFoundException* when we use the class of jar to create 
 udf.For example:
 {noformat}
 spark-sql add jar /home/jeanlyn/hello.jar;
 spark-sqlcreate temporary function hello as 'hello';
 spark-sqlselect hello(name) from person;
 Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most 
 recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): 
 java.lang.ClassNotFoundException: hello
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution

2015-06-15 Thread jeanlyn (JIRA)
jeanlyn created SPARK-8379:
--

 Summary: LeaseExpiredException when using dynamic partition with 
speculative execution
 Key: SPARK-8379
 URL: https://issues.apache.org/jira/browse/SPARK-8379
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.3.1, 1.3.0
Reporter: jeanlyn


when inserting to table using dynamic partitions with *spark.speculation=true*  
and there is a skew data of some partitions trigger the speculative tasks ,it 
will throws the exception like
{code}
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 Lease mismatch on 
/tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo
 owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but is 
accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8020) Spark SQL conf in spark-defaults.conf make metadataHive get constructed too early

2015-06-02 Thread jeanlyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570303#comment-14570303
 ] 

jeanlyn commented on SPARK-8020:


I had tried to put the settings back to *spark-defaults.conf* just now,and i 
builded spark with rc4.I still got the same excption as i mentioned about

 Spark SQL conf in spark-defaults.conf make metadataHive get constructed too 
 early
 -

 Key: SPARK-8020
 URL: https://issues.apache.org/jira/browse/SPARK-8020
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.4.0


 To correctly construct a {{metadataHive}} object, we need two settings, 
 {{spark.sql.hive.metastore.version}} and {{spark.sql.hive.metastore.jars}}. 
 If users want to use Hive 0.12's metastore, they need to set 
 {{spark.sql.hive.metastore.version}} to {{0.12.0}} and set 
 {{spark.sql.hive.metastore.jars}} to {{maven}} or a classpath containing Hive 
 and Hadoop's jars. However, any spark sql setting in the 
 {{spark-defaults.conf}} will trigger the construction of {{metadataHive}} and 
 cause Spark SQL connect to the wrong metastore (e.g. connect to the local 
 derby metastore instead of a remove mysql Hive 0.12 metastore). Also, if 
 {{spark.sql.hive.metastore.version 0.12.0}} is the first conf set to SQL 
 conf, we will get
 {code}
 Exception in thread main java.lang.IllegalArgumentException: Builtin jars 
 can only be used when hive execution version == hive metastore version. 
 Execution: 0.13.1 != Metastore: 0.12.0. Specify a vaild path to the correct 
 hive jars using $HIVE_METASTORE_JARS or change 
 spark.sql.hive.metastore.version to 0.13.1.
   at 
 org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:186)
   at 
 org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175)
   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:358)
   at 
 org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:186)
   at 
 org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:185)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at org.apache.spark.sql.SQLContext.init(SQLContext.scala:185)
   at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:71)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:53)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.init(SparkSQLCLIDriver.scala:248)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:136)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8020) Spark SQL conf in spark-defaults.conf make metadataHive get constructed too early

2015-06-02 Thread jeanlyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570303#comment-14570303
 ] 

jeanlyn edited comment on SPARK-8020 at 6/3/15 5:38 AM:


I had tried to put the settings back to *spark-defaults.conf* just now,and i 
builded spark with rc4.I still got the same * ClassNotFoundException* excption 
as i mentioned about


was (Author: jeanlyn):
I had tried to put the settings back to *spark-defaults.conf* just now,and i 
builded spark with rc4.I still got the same excption as i mentioned about

 Spark SQL conf in spark-defaults.conf make metadataHive get constructed too 
 early
 -

 Key: SPARK-8020
 URL: https://issues.apache.org/jira/browse/SPARK-8020
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.4.0


 To correctly construct a {{metadataHive}} object, we need two settings, 
 {{spark.sql.hive.metastore.version}} and {{spark.sql.hive.metastore.jars}}. 
 If users want to use Hive 0.12's metastore, they need to set 
 {{spark.sql.hive.metastore.version}} to {{0.12.0}} and set 
 {{spark.sql.hive.metastore.jars}} to {{maven}} or a classpath containing Hive 
 and Hadoop's jars. However, any spark sql setting in the 
 {{spark-defaults.conf}} will trigger the construction of {{metadataHive}} and 
 cause Spark SQL connect to the wrong metastore (e.g. connect to the local 
 derby metastore instead of a remove mysql Hive 0.12 metastore). Also, if 
 {{spark.sql.hive.metastore.version 0.12.0}} is the first conf set to SQL 
 conf, we will get
 {code}
 Exception in thread main java.lang.IllegalArgumentException: Builtin jars 
 can only be used when hive execution version == hive metastore version. 
 Execution: 0.13.1 != Metastore: 0.12.0. Specify a vaild path to the correct 
 hive jars using $HIVE_METASTORE_JARS or change 
 spark.sql.hive.metastore.version to 0.13.1.
   at 
 org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:186)
   at 
 org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:175)
   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:358)
   at 
 org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:186)
   at 
 org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:185)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at org.apache.spark.sql.SQLContext.init(SQLContext.scala:185)
   at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:71)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:53)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.init(SparkSQLCLIDriver.scala:248)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:136)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8020) Spark SQL in spark-defaults.conf make metadataHive get constructed too early

2015-06-01 Thread jeanlyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568427#comment-14568427
 ] 

jeanlyn commented on SPARK-8020:


[~yhuai],I set *spark.sql.hive.metastore.jars* in spark-defaults.conf i got 
errors like yours.But when i set *spark.sql.hive.metastore.jars* in 
*hive-set.xml* i got
{code}
5/06/02 10:42:04 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
15/06/02 10:42:04 INFO storage.BlockManagerMasterEndpoint: Registering block 
manager localhost:41416 with 706.6 MB RAM, BlockManagerId(driver, localhost, 
41416)
15/06/02 10:42:04 INFO storage.BlockManagerMaster: Registered BlockManager
SET spark.sql.hive.metastore.version=0.12.0
15/06/02 10:42:04 WARN conf.HiveConf: DEPRECATED: Configuration property 
hive.metastore.local no longer has any effect. Make sure to provide a valid 
value for hive.metastore.u
ris if you are connecting to a remote metastore.
15/06/02 10:42:04 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
longer has any effect.  Use hive.hmshandler.retry.* instead
15/06/02 10:42:04 INFO hive.HiveContext: Initializing HiveMetastoreConnection 
version 0.12.0 using maven.
Ivy Default Cache set to: /home/dd_edw/.ivy2/cache
The jars for the packages stored in: /home/dd_edw/.ivy2/jars
http://www.datanucleus.org/downloads/maven2 added as a remote repository with 
the name: repo-1
:: loading settings :: url = 
jar:file:/data0/spark-1.3.0-bin-2.2.0/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hive#hive-metastore added as a dependency
org.apache.hive#hive-exec added as a dependency
org.apache.hive#hive-common added as a dependency
org.apache.hive#hive-serde added as a dependency
com.google.guava#guava added as a dependency
org.apache.hadoop#hadoop-client added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
   confs: [default]
   found org.apache.hive#hive-metastore;0.12.0 in central
   found org.antlr#antlr;3.4 in central
   found org.antlr#antlr-runtime;3.4 in central

xception in thread main java.lang.ClassNotFoundException: 
java.lang.NoClassDefFoundError: com/google/common/base/Preconditions when 
creating Hive client using classpath: fi
le:/tmp/hive3795822184995995241vv12/aopalliance_aopalliance-1.0.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.hive_hive-exec-0.12.0.jar, 
file:/tmp/hive3795822184995995
241vv12/org.apache.thrift_libfb303-0.9.0.jar, 
file:/tmp/hive3795822184995995241vv12/commons-digester_commons-digester-1.8.jar,
 file:/tmp/hive3795822184995995241vv12/com.sun.je
rsey_jersey-client-1.9.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.httpcomponents_httpclient-4.2.5.jar,
 file:/tmp/hive3795822184995995241vv12/org.antlr_stringtemplat
e-3.2.1.jar, 
file:/tmp/hive3795822184995995241vv12/commons-logging_commons-logging-1.1.3.jar,
 file:/tmp/hive3795822184995995241vv12/org.antlr_antlr-runtime-3.4.jar, 
file:/tmp/
hive3795822184995995241vv12/org.mockito_mockito-all-1.8.2.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.derby_derby-10.4.2.0.jar, 
file:/tmp/hive3795822184995995241vv12
/antlr_antlr-2.7.7.jar, 
file:/tmp/hive3795822184995995241vv12/commons-net_commons-net-3.1.jar, 
file:/tmp/hive3795822184995995241vv12/org.slf4j_slf4j-log4j12-1.7.5.jar, file:/t
mp/hive3795822184995995241vv12/junit_junit-3.8.1.jar, 
file:/tmp/hive3795822184995995241vv12/org.codehaus.jackson_jackson-jaxrs-1.8.8.jar,
 file:/tmp/hive3795822184995995241vv12
/commons-cli_commons-cli-1.2.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.hive_hive-serde-0.12.0.jar, 
file:/tmp/hive3795822184995995241vv12/org.codehaus.jettison_jett
ison-1.1.jar, 
file:/tmp/hive3795822184995995241vv12/javax.xml.stream_stax-api-1.0-2.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.avro_avro-1.7.4.jar, 
file:/tmp/hive37
95822184995995241vv12/org.apache.hadoop_hadoop-mapreduce-client-app-2.4.0.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.hadoop_hadoop-mapreduce-client-common-2.4.0.jar
, 
file:/tmp/hive3795822184995995241vv12/org.codehaus.jackson_jackson-xc-1.8.8.jar,
 
file:/tmp/hive3795822184995995241vv12/org.apache.hadoop_hadoop-annotations-2.4.0.jar,
 file:/
tmp/hive3795822184995995241vv12/org.mortbay.jetty_jetty-util-6.1.26.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.commons_commons-math3-3.1.1.jar,
 file:/tmp/hive379582
2184995995241vv12/javax.transaction_jta-1.1.jar, 
file:/tmp/hive3795822184995995241vv12/commons-httpclient_commons-httpclient-3.1.jar,
 file:/tmp/hive3795822184995995241vv12/xml
enc_xmlenc-0.52.jar, 
file:/tmp/hive3795822184995995241vv12/org.sonatype.sisu.inject_cglib-2.2.1-v20090111.jar,
 file:/tmp/hive3795822184995995241vv12/com.google.code.findbugs_j
sr305-1.3.9.jar, 
file:/tmp/hive3795822184995995241vv12/commons-codec_commons-codec-1.4.jar, 

[jira] [Comment Edited] (SPARK-8020) Spark SQL in spark-defaults.conf make metadataHive get constructed too early

2015-06-01 Thread jeanlyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568427#comment-14568427
 ] 

jeanlyn edited comment on SPARK-8020 at 6/2/15 3:16 AM:


[~yhuai],I set *spark.sql.hive.metastore.jars* in spark-defaults.conf i got 
errors like yours.But when i set *spark.sql.hive.metastore.jars* in 
*hive-site.xml* i got
{code}
5/06/02 10:42:04 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
15/06/02 10:42:04 INFO storage.BlockManagerMasterEndpoint: Registering block 
manager localhost:41416 with 706.6 MB RAM, BlockManagerId(driver, localhost, 
41416)
15/06/02 10:42:04 INFO storage.BlockManagerMaster: Registered BlockManager
SET spark.sql.hive.metastore.version=0.12.0
15/06/02 10:42:04 WARN conf.HiveConf: DEPRECATED: Configuration property 
hive.metastore.local no longer has any effect. Make sure to provide a valid 
value for hive.metastore.u
ris if you are connecting to a remote metastore.
15/06/02 10:42:04 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
longer has any effect.  Use hive.hmshandler.retry.* instead
15/06/02 10:42:04 INFO hive.HiveContext: Initializing HiveMetastoreConnection 
version 0.12.0 using maven.
Ivy Default Cache set to: /home/dd_edw/.ivy2/cache
The jars for the packages stored in: /home/dd_edw/.ivy2/jars
http://www.datanucleus.org/downloads/maven2 added as a remote repository with 
the name: repo-1
:: loading settings :: url = 
jar:file:/data0/spark-1.3.0-bin-2.2.0/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hive#hive-metastore added as a dependency
org.apache.hive#hive-exec added as a dependency
org.apache.hive#hive-common added as a dependency
org.apache.hive#hive-serde added as a dependency
com.google.guava#guava added as a dependency
org.apache.hadoop#hadoop-client added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
   confs: [default]
   found org.apache.hive#hive-metastore;0.12.0 in central
   found org.antlr#antlr;3.4 in central
   found org.antlr#antlr-runtime;3.4 in central

xception in thread main java.lang.ClassNotFoundException: 
java.lang.NoClassDefFoundError: com/google/common/base/Preconditions when 
creating Hive client using classpath: fi
le:/tmp/hive3795822184995995241vv12/aopalliance_aopalliance-1.0.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.hive_hive-exec-0.12.0.jar, 
file:/tmp/hive3795822184995995
241vv12/org.apache.thrift_libfb303-0.9.0.jar, 
file:/tmp/hive3795822184995995241vv12/commons-digester_commons-digester-1.8.jar,
 file:/tmp/hive3795822184995995241vv12/com.sun.je
rsey_jersey-client-1.9.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.httpcomponents_httpclient-4.2.5.jar,
 file:/tmp/hive3795822184995995241vv12/org.antlr_stringtemplat
e-3.2.1.jar, 
file:/tmp/hive3795822184995995241vv12/commons-logging_commons-logging-1.1.3.jar,
 file:/tmp/hive3795822184995995241vv12/org.antlr_antlr-runtime-3.4.jar, 
file:/tmp/
hive3795822184995995241vv12/org.mockito_mockito-all-1.8.2.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.derby_derby-10.4.2.0.jar, 
file:/tmp/hive3795822184995995241vv12
/antlr_antlr-2.7.7.jar, 
file:/tmp/hive3795822184995995241vv12/commons-net_commons-net-3.1.jar, 
file:/tmp/hive3795822184995995241vv12/org.slf4j_slf4j-log4j12-1.7.5.jar, file:/t
mp/hive3795822184995995241vv12/junit_junit-3.8.1.jar, 
file:/tmp/hive3795822184995995241vv12/org.codehaus.jackson_jackson-jaxrs-1.8.8.jar,
 file:/tmp/hive3795822184995995241vv12
/commons-cli_commons-cli-1.2.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.hive_hive-serde-0.12.0.jar, 
file:/tmp/hive3795822184995995241vv12/org.codehaus.jettison_jett
ison-1.1.jar, 
file:/tmp/hive3795822184995995241vv12/javax.xml.stream_stax-api-1.0-2.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.avro_avro-1.7.4.jar, 
file:/tmp/hive37
95822184995995241vv12/org.apache.hadoop_hadoop-mapreduce-client-app-2.4.0.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.hadoop_hadoop-mapreduce-client-common-2.4.0.jar
, 
file:/tmp/hive3795822184995995241vv12/org.codehaus.jackson_jackson-xc-1.8.8.jar,
 
file:/tmp/hive3795822184995995241vv12/org.apache.hadoop_hadoop-annotations-2.4.0.jar,
 file:/
tmp/hive3795822184995995241vv12/org.mortbay.jetty_jetty-util-6.1.26.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.commons_commons-math3-3.1.1.jar,
 file:/tmp/hive379582
2184995995241vv12/javax.transaction_jta-1.1.jar, 
file:/tmp/hive3795822184995995241vv12/commons-httpclient_commons-httpclient-3.1.jar,
 file:/tmp/hive3795822184995995241vv12/xml
enc_xmlenc-0.52.jar, 
file:/tmp/hive3795822184995995241vv12/org.sonatype.sisu.inject_cglib-2.2.1-v20090111.jar,
 file:/tmp/hive3795822184995995241vv12/com.google.code.findbugs_j
sr305-1.3.9.jar, 
file:/tmp/hive3795822184995995241vv12/commons-codec_commons-codec-1.4.jar, 

[jira] [Comment Edited] (SPARK-8020) Spark SQL in spark-defaults.conf make metadataHive get constructed too early

2015-06-01 Thread jeanlyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568427#comment-14568427
 ] 

jeanlyn edited comment on SPARK-8020 at 6/2/15 2:48 AM:


[~yhuai],I set *spark.sql.hive.metastore.jars* in spark-defaults.conf i got 
errors like yours.But when i set *spark.sql.hive.metastore.jars* in 
*hive-set.xml* i got
{code}
5/06/02 10:42:04 INFO storage.BlockManagerMaster: Trying to register 
BlockManager
15/06/02 10:42:04 INFO storage.BlockManagerMasterEndpoint: Registering block 
manager localhost:41416 with 706.6 MB RAM, BlockManagerId(driver, localhost, 
41416)
15/06/02 10:42:04 INFO storage.BlockManagerMaster: Registered BlockManager
SET spark.sql.hive.metastore.version=0.12.0
15/06/02 10:42:04 WARN conf.HiveConf: DEPRECATED: Configuration property 
hive.metastore.local no longer has any effect. Make sure to provide a valid 
value for hive.metastore.u
ris if you are connecting to a remote metastore.
15/06/02 10:42:04 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
longer has any effect.  Use hive.hmshandler.retry.* instead
15/06/02 10:42:04 INFO hive.HiveContext: Initializing HiveMetastoreConnection 
version 0.12.0 using maven.
Ivy Default Cache set to: /home/dd_edw/.ivy2/cache
The jars for the packages stored in: /home/dd_edw/.ivy2/jars
http://www.datanucleus.org/downloads/maven2 added as a remote repository with 
the name: repo-1
:: loading settings :: url = 
jar:file:/data0/spark-1.3.0-bin-2.2.0/lib/spark-assembly-1.4.0-SNAPSHOT-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hive#hive-metastore added as a dependency
org.apache.hive#hive-exec added as a dependency
org.apache.hive#hive-common added as a dependency
org.apache.hive#hive-serde added as a dependency
com.google.guava#guava added as a dependency
org.apache.hadoop#hadoop-client added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
   confs: [default]
   found org.apache.hive#hive-metastore;0.12.0 in central
   found org.antlr#antlr;3.4 in central
   found org.antlr#antlr-runtime;3.4 in central

xception in thread main java.lang.ClassNotFoundException: 
java.lang.NoClassDefFoundError: com/google/common/base/Preconditions when 
creating Hive client using classpath: fi
le:/tmp/hive3795822184995995241vv12/aopalliance_aopalliance-1.0.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.hive_hive-exec-0.12.0.jar, 
file:/tmp/hive3795822184995995
241vv12/org.apache.thrift_libfb303-0.9.0.jar, 
file:/tmp/hive3795822184995995241vv12/commons-digester_commons-digester-1.8.jar,
 file:/tmp/hive3795822184995995241vv12/com.sun.je
rsey_jersey-client-1.9.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.httpcomponents_httpclient-4.2.5.jar,
 file:/tmp/hive3795822184995995241vv12/org.antlr_stringtemplat
e-3.2.1.jar, 
file:/tmp/hive3795822184995995241vv12/commons-logging_commons-logging-1.1.3.jar,
 file:/tmp/hive3795822184995995241vv12/org.antlr_antlr-runtime-3.4.jar, 
file:/tmp/
hive3795822184995995241vv12/org.mockito_mockito-all-1.8.2.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.derby_derby-10.4.2.0.jar, 
file:/tmp/hive3795822184995995241vv12
/antlr_antlr-2.7.7.jar, 
file:/tmp/hive3795822184995995241vv12/commons-net_commons-net-3.1.jar, 
file:/tmp/hive3795822184995995241vv12/org.slf4j_slf4j-log4j12-1.7.5.jar, file:/t
mp/hive3795822184995995241vv12/junit_junit-3.8.1.jar, 
file:/tmp/hive3795822184995995241vv12/org.codehaus.jackson_jackson-jaxrs-1.8.8.jar,
 file:/tmp/hive3795822184995995241vv12
/commons-cli_commons-cli-1.2.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.hive_hive-serde-0.12.0.jar, 
file:/tmp/hive3795822184995995241vv12/org.codehaus.jettison_jett
ison-1.1.jar, 
file:/tmp/hive3795822184995995241vv12/javax.xml.stream_stax-api-1.0-2.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.avro_avro-1.7.4.jar, 
file:/tmp/hive37
95822184995995241vv12/org.apache.hadoop_hadoop-mapreduce-client-app-2.4.0.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.hadoop_hadoop-mapreduce-client-common-2.4.0.jar
, 
file:/tmp/hive3795822184995995241vv12/org.codehaus.jackson_jackson-xc-1.8.8.jar,
 
file:/tmp/hive3795822184995995241vv12/org.apache.hadoop_hadoop-annotations-2.4.0.jar,
 file:/
tmp/hive3795822184995995241vv12/org.mortbay.jetty_jetty-util-6.1.26.jar, 
file:/tmp/hive3795822184995995241vv12/org.apache.commons_commons-math3-3.1.1.jar,
 file:/tmp/hive379582
2184995995241vv12/javax.transaction_jta-1.1.jar, 
file:/tmp/hive3795822184995995241vv12/commons-httpclient_commons-httpclient-3.1.jar,
 file:/tmp/hive3795822184995995241vv12/xml
enc_xmlenc-0.52.jar, 
file:/tmp/hive3795822184995995241vv12/org.sonatype.sisu.inject_cglib-2.2.1-v20090111.jar,
 file:/tmp/hive3795822184995995241vv12/com.google.code.findbugs_j
sr305-1.3.9.jar, 
file:/tmp/hive3795822184995995241vv12/commons-codec_commons-codec-1.4.jar, 

[jira] [Created] (SPARK-7885) add config to control map aggregation in spark sql

2015-05-26 Thread jeanlyn (JIRA)
jeanlyn created SPARK-7885:
--

 Summary: add config to control map aggregation in spark sql
 Key: SPARK-7885
 URL: https://issues.apache.org/jira/browse/SPARK-7885
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.3.1, 1.2.2, 1.2.0
Reporter: jeanlyn


For now, *execution.HashAggregation* add the map aggregation in oder to 
decrease the shuffle data.However,we found gc problem when we use this 
optimization and finally the executor crash.For example,
{noformat} 
select sale_ord_id as order_id,
  coalesce(sum(sku_offer_amount),0.0) as sku_offer_amount,
  coalesce(sum(suit_offer_amount),0.0) as suit_offer_amount,
  coalesce(sum(flash_gp_offer_amount),0.0) + 
coalesce(sum(gp_offer_amount),0.0) as gp_offer_amount,
  coalesce(sum(flash_gp_offer_amount),0.0) as flash_gp_offer_amount,
  coalesce(sum(full_minus_offer_amount),0.0) as full_rebate_offer_amount,
  0.0 as telecom_point_offer_amount,
  coalesce(sum(coupon_pay_amount),0.0) as dq_and_jq_pay_amount,
  coalesce(sum(jq_pay_amount),0.0) + 
coalesce(sum(pop_shop_jq_pay_amount),0.0) + 
coalesce(sum(lim_cate_jq_pay_amount),0.0) as jq_pay_amount,
  coalesce(sum(dq_pay_amount),0.0) + 
coalesce(sum(pop_shop_dq_pay_amount),0.0) + 
coalesce(sum(lim_cate_dq_pay_amount),0.0) as dq_pay_amount,
  coalesce(sum(gift_cps_pay_amount),0.0) as gift_cps_pay_amount ,
  coalesce(sum(mobile_red_packet_pay_amount),0.0) as 
mobile_red_packet_pay_amount,
  coalesce(sum(acct_bal_pay_amount),0.0) as acct_bal_pay_amount,
  coalesce(sum(jbean_pay_amount),0.0) as jbean_pay_amount,
  coalesce(sum(sku_rebate_amount),0.0) as sku_rebate_amount,
  coalesce(sum(yixun_point_pay_amount),0.0) as yixun_point_pay_amount,
  coalesce(sum(sku_freight_coupon_amount),0.0) as freight_coupon_amount
fromord_at_det_di
where   ds = '2015-05-20'
group  by   sale_ord_id
{noformat}
the sql scan two text files and each file is 360MB,we use 6 executor, each 
executor has 8GB memory and 2 cpu.
We can add a config control map aggregation to avoid it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6392) [SQL]class not found exception thows when `add jar` use spark cli

2015-03-17 Thread jeanlyn (JIRA)
jeanlyn created SPARK-6392:
--

 Summary: [SQL]class not found exception thows when `add jar` use 
spark cli 
 Key: SPARK-6392
 URL: https://issues.apache.org/jira/browse/SPARK-6392
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn
Priority: Minor


When we use spark cli to add jar dynamic,we will get the 
*java.lang.ClassNotFoundException* when we use the class of jar to create 
udf.For example:
{noformat}
spark-sql add jar /home/jeanlyn/hello.jar;
spark-sqlcreate temporary function hello as 'hello';
spark-sqlselect hello(name) from person;
Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): 
java.lang.ClassNotFoundException: hello
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5498) [SPARK-SQL]when the partition schema does not match table schema,it throws java.lang.ClassCastException and so on

2015-01-30 Thread jeanlyn (JIRA)
jeanlyn created SPARK-5498:
--

 Summary: [SPARK-SQL]when the partition schema does not match table 
schema,it throws java.lang.ClassCastException and so on
 Key: SPARK-5498
 URL: https://issues.apache.org/jira/browse/SPARK-5498
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn


when the partition schema does not match table schema,it will thows exception 
when the task is running.For example,we modify the type of column from int to 
bigint by the sql *ALTER TABLE table_with_partition CHANGE COLUMN key key 
BIGINT* ,then we query the patition data which was stored before the 
changing,we would get the exception:
{noformat}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in stage 27.0 
(TID 30, BJHC-HADOOP-HERA-16950.jeanlyn.local): java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableInt
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$13$$anonfun$apply$4.apply(TableReader.scala:286)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$13$$anonfun$apply$4.apply(TableReader.scala:286)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:322)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
at 
org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at 

[jira] [Updated] (SPARK-5068) When the path not found in the hdfs,we can't get the result

2015-01-05 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn updated SPARK-5068:
---
Fix Version/s: (was: 1.2.1)

 When the path not found in the hdfs,we can't get the result
 ---

 Key: SPARK-5068
 URL: https://issues.apache.org/jira/browse/SPARK-5068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn

 when the partion path was found in the metastore but not found in the hdfs,it 
 will casue some problems as follow:
 {noformat}
 hive show partitions partition_test;
 OK
 dt=1
 dt=2
 dt=3
 dt=4
 Time taken: 0.168 seconds, Fetched: 4 row(s)
 {noformat}
 {noformat}
 hive dfs -ls /user/jeanlyn/warehouse/partition_test;
 Found 3 items
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
 /user/jeanlyn/warehouse/partition_test/dt=1
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
 /user/jeanlyn/warehouse/partition_test/dt=3
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
 /user/jeanlyn/warehouse/partition_test/dt=4
 {noformat}
 when i run the sql 
 {noformat}
 select * from partition_test limit 10
 {noformat} in  *hive*,i got no problem,but when i run in *spark-sql* i get 
 the error as follow:
 {noformat}
 Exception in thread main org.apache.hadoop.mapred.InvalidInputException: 
 Input path does not exist: 
 hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2
 at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
 at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:780)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
 at org.apache.spark.sql.hive.testpartition$.main(test.scala:23)
 at org.apache.spark.sql.hive.testpartition.main(test.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-5068) When the path not found in the hdfs,we can't get the result

2015-01-04 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn updated SPARK-5068:
---
Fix Version/s: 1.2.1

 When the path not found in the hdfs,we can't get the result
 ---

 Key: SPARK-5068
 URL: https://issues.apache.org/jira/browse/SPARK-5068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn
 Fix For: 1.2.1


 when the partion path was found in the metastore but not found in the hdfs,it 
 will casue some problems as follow:
 {noformat}
 hive show partitions partition_test;
 OK
 dt=1
 dt=2
 dt=3
 dt=4
 Time taken: 0.168 seconds, Fetched: 4 row(s)
 {noformat}
 {noformat}
 hive dfs -ls /user/jeanlyn/warehouse/partition_test;
 Found 3 items
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
 /user/jeanlyn/warehouse/partition_test/dt=1
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
 /user/jeanlyn/warehouse/partition_test/dt=3
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
 /user/jeanlyn/warehouse/partition_test/dt=4
 {noformat}
 when i run the sql 
 {noformat}
 select * from partition_test limit 10
 {noformat} in  *hive*,i got no problem,but when i run in *spark-sql* i get 
 the error as follow:
 {noformat}
 Exception in thread main org.apache.hadoop.mapred.InvalidInputException: 
 Input path does not exist: 
 hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2
 at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
 at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:780)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
 at org.apache.spark.sql.hive.testpartition$.main(test.scala:23)
 at org.apache.spark.sql.hive.testpartition.main(test.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-5084) when mysql is used as the metadata storage for spark-sql, Exception occurs when HiveQuerySuite is excute

2015-01-04 Thread jeanlyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264183#comment-14264183
 ] 

jeanlyn commented on SPARK-5084:


more description?

 when mysql is used as the metadata storage for spark-sql, Exception occurs 
 when HiveQuerySuite is excute 
 -

 Key: SPARK-5084
 URL: https://issues.apache.org/jira/browse/SPARK-5084
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: baishuo





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5068) When the path not found in the hdfs,we can't get the result

2015-01-03 Thread jeanlyn (JIRA)
jeanlyn created SPARK-5068:
--

 Summary: When the path not found in the hdfs,we can't get the 
result
 Key: SPARK-5068
 URL: https://issues.apache.org/jira/browse/SPARK-5068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn


when the partion path was found in the metastore but not found in the hdfs,it 
will casue some problems as follow:
```
hive show partitions partition_test;
OK
dt=1
dt=2
dt=3
dt=4
Time taken: 0.168 seconds, Fetched: 4 row(s)
```

```
hive dfs -ls /user/jeanlyn/warehouse/partition_test;
Found 3 items
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=1
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=3
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
/user/jeanlyn/warehouse/partition_test/dt=4
```
when i run the sq `select * from partition_test limit 10`l in  **hive**,i got 
no problem,but when i run in spark-sql i get the error as follow:

```
Exception in thread main org.apache.hadoop.mapred.InvalidInputException: 
Input path does not exist: 
hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
at org.apache.spark.rdd.RDD.collect(RDD.scala:780)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
at org.apache.spark.sql.hive.testpartition$.main(test.scala:23)
at org.apache.spark.sql.hive.testpartition.main(test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
```




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5068) When the path not found in the hdfs,we can't get the result

2015-01-03 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn updated SPARK-5068:
---
Description: 
when the partion path was found in the metastore but not found in the hdfs,it 
will casue some problems as follow:
```
hive show partitions partition_test;
OK
dt=1
dt=2
dt=3
dt=4
Time taken: 0.168 seconds, Fetched: 4 row(s)
```

```
hive dfs -ls /user/jeanlyn/warehouse/partition_test;
Found 3 items
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=1
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=3
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
/user/jeanlyn/warehouse/partition_test/dt=4
```
when i run the sq `select * from partition_test limit 10` in  **hive**,i got no 
problem,but when i run in spark-sql i get the error as follow:

```
Exception in thread main org.apache.hadoop.mapred.InvalidInputException: 
Input path does not exist: 
hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
at org.apache.spark.rdd.RDD.collect(RDD.scala:780)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
at org.apache.spark.sql.hive.testpartition$.main(test.scala:23)
at org.apache.spark.sql.hive.testpartition.main(test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
```


  was:
when the partion path was found in the metastore but not found in the hdfs,it 
will casue some problems as follow:
```
hive show partitions partition_test;
OK
dt=1
dt=2
dt=3
dt=4
Time taken: 0.168 seconds, Fetched: 4 row(s)
```

```
hive dfs -ls /user/jeanlyn/warehouse/partition_test;
Found 3 items
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=1
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=3
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
/user/jeanlyn/warehouse/partition_test/dt=4
```
when i run the sq `select * from partition_test limit 10`l in  **hive**,i 

[jira] [Updated] (SPARK-5068) When the path not found in the hdfs,we can't get the result

2015-01-03 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn updated SPARK-5068:
---
Description: 
when the partion path was found in the metastore but not found in the hdfs,it 
will casue some problems as follow:
{noformat}
hive show partitions partition_test;
OK
dt=1
dt=2
dt=3
dt=4
Time taken: 0.168 seconds, Fetched: 4 row(s)
{noformat}

{noformat}
hive dfs -ls /user/jeanlyn/warehouse/partition_test;
Found 3 items
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=1
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=3
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
/user/jeanlyn/warehouse/partition_test/dt=4
{noformat}
when i run the sql 
{noformat}
select * from partition_test limit 10
{noformat} in  *hive*,i got no problem,but when i run in *spark-sql* i get the 
error as follow:

{noformat}
Exception in thread main org.apache.hadoop.mapred.InvalidInputException: 
Input path does not exist: 
hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
at org.apache.spark.rdd.RDD.collect(RDD.scala:780)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
at org.apache.spark.sql.hive.testpartition$.main(test.scala:23)
at org.apache.spark.sql.hive.testpartition.main(test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
{noformat}


  was:
when the partion path was found in the metastore but not found in the hdfs,it 
will casue some problems as follow:
```
hive show partitions partition_test;
OK
dt=1
dt=2
dt=3
dt=4
Time taken: 0.168 seconds, Fetched: 4 row(s)
```

```
hive dfs -ls /user/jeanlyn/warehouse/partition_test;
Found 3 items
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=1
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=3
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
/user/jeanlyn/warehouse/partition_test/dt=4
```
when i 

[jira] [Issue Comment Deleted] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions

2014-10-23 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn updated SPARK-3967:
---
Comment: was deleted

(was: dsa
dsa)

 Spark applications fail in yarn-cluster mode when the directories configured 
 in yarn.nodemanager.local-dirs are located on different disks/partitions
 -

 Key: SPARK-3967
 URL: https://issues.apache.org/jira/browse/SPARK-3967
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Christophe PRÉAUD
 Attachments: spark-1.1.0-utils-fetch.patch, 
 spark-1.1.0-yarn_cluster_tmpdir.patch


 Spark applications fail from time to time in yarn-cluster mode (but not in 
 yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is 
 set to a comma-separated list of directories which are located on different 
 disks/partitions.
 Steps to reproduce:
 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of 
 directories located on different partitions (the more you set, the more 
 likely it will be to reproduce the bug):
 (...)
 property
   nameyarn.nodemanager.local-dirs/name
   
 valuefile:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir/value
 /property
 (...)
 2. Launch (several times) an application in yarn-cluster mode, it will fail 
 (apparently randomly) from time to time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions

2014-10-23 Thread jeanlyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181057#comment-14181057
 ] 

jeanlyn commented on SPARK-3967:


dsa
dsa

 Spark applications fail in yarn-cluster mode when the directories configured 
 in yarn.nodemanager.local-dirs are located on different disks/partitions
 -

 Key: SPARK-3967
 URL: https://issues.apache.org/jira/browse/SPARK-3967
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Christophe PRÉAUD
 Attachments: spark-1.1.0-utils-fetch.patch, 
 spark-1.1.0-yarn_cluster_tmpdir.patch


 Spark applications fail from time to time in yarn-cluster mode (but not in 
 yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is 
 set to a comma-separated list of directories which are located on different 
 disks/partitions.
 Steps to reproduce:
 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of 
 directories located on different partitions (the more you set, the more 
 likely it will be to reproduce the bug):
 (...)
 property
   nameyarn.nodemanager.local-dirs/name
   
 valuefile:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir/value
 /property
 (...)
 2. Launch (several times) an application in yarn-cluster mode, it will fail 
 (apparently randomly) from time to time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions

2014-10-21 Thread jeanlyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178221#comment-14178221
 ] 

jeanlyn commented on SPARK-3967:


This issue also hapens to me.but i want to know why this issue don't happens to 
the *yarn-client* mode

 Spark applications fail in yarn-cluster mode when the directories configured 
 in yarn.nodemanager.local-dirs are located on different disks/partitions
 -

 Key: SPARK-3967
 URL: https://issues.apache.org/jira/browse/SPARK-3967
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Christophe PRÉAUD
 Attachments: spark-1.1.0-utils-fetch.patch, 
 spark-1.1.0-yarn_cluster_tmpdir.patch


 Spark applications fail from time to time in yarn-cluster mode (but not in 
 yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is 
 set to a comma-separated list of directories which are located on different 
 disks/partitions.
 Steps to reproduce:
 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of 
 directories located on different partitions (the more you set, the more 
 likely it will be to reproduce the bug):
 (...)
 property
   nameyarn.nodemanager.local-dirs/name
   
 valuefile:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir/value
 /property
 (...)
 2. Launch (several times) an application in yarn-cluster mode, it will fail 
 (apparently randomly) from time to time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions

2014-10-21 Thread jeanlyn (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178291#comment-14178291
 ] 

jeanlyn commented on SPARK-3967:


I think you this pull request also can be referenced:
https://github.com/apache/spark/pull/1616

 Spark applications fail in yarn-cluster mode when the directories configured 
 in yarn.nodemanager.local-dirs are located on different disks/partitions
 -

 Key: SPARK-3967
 URL: https://issues.apache.org/jira/browse/SPARK-3967
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Christophe PRÉAUD
 Attachments: spark-1.1.0-utils-fetch.patch, 
 spark-1.1.0-yarn_cluster_tmpdir.patch


 Spark applications fail from time to time in yarn-cluster mode (but not in 
 yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is 
 set to a comma-separated list of directories which are located on different 
 disks/partitions.
 Steps to reproduce:
 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of 
 directories located on different partitions (the more you set, the more 
 likely it will be to reproduce the bug):
 (...)
 property
   nameyarn.nodemanager.local-dirs/name
   
 valuefile:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir/value
 /property
 (...)
 2. Launch (several times) an application in yarn-cluster mode, it will fail 
 (apparently randomly) from time to time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org