RE: [GitHub] [incubator-crail] PepperJo opened a new pull request #82: [NVMf] Make keepalive thread a daemon thread

David Crespi Fri, 16 Aug 2019 14:10:58 -0700

Jonas,

I don’t know if this is your problem or not, so I thought I provide some info.


This behavior started after running the latest crail code, and it does appear

to have something to do with opening and closing files.



A few more things to note.

Running terasort with hdfs for tmp data & crail shuffle is okay

Running terrasort with crail for tmp data & not using crail shuffle is okay

Running terasort with crail for tmp data & crail shuffle causes this problem.

So when crail is used for everything, it falls apart.



What spdk is showing:

Aug 16 14:00:30 minnie a3f92fbd8b97[2136]: 
rdma.c:1422:spdk_nvmf_rdma_request_parse_sgl: *ERROR*: SGL length 0x77400 
exceeds max io size 0x20000

Aug 16 14:00:30 minnie a3f92fbd8b97[2136]: 
rdma.c:2501:spdk_nvmf_process_ib_event: *NOTICE*: Async event: last WQE reached

Aug 16 14:00:30 minnie a3f92fbd8b97[2136]: 
rdma.c:2501:spdk_nvmf_process_ib_event: *NOTICE*: Async event: last WQE reached





>From the client

************************************

         TeraGen

************************************

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in 
[jar:file:/crail/jars/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in 
[jar:file:/crail/jars/jnvmf-1.7-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in 
[jar:file:/crail/jars/disni-2.1-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in 
[jar:file:/usr/spark-2.4.2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

19/08/16 13:50:11 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable

===========================================================================

===========================================================================

Input size: 1000MB

Total number of records: 10000000

Number of output partitions: 2

Number of records/output partition: 5000000

===========================================================================

===========================================================================

Number of records written: 10000000



real    0m13.046s

user    0m13.633s

sys     0m1.578s

************************************

         TeraSort

************************************

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in 
[jar:file:/crail/jars/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in 
[jar:file:/crail/jars/jnvmf-1.7-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in 
[jar:file:/crail/jars/disni-2.1-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in 
[jar:file:/usr/spark-2.4.2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

19/08/16 13:50:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable

19/08/16 13:51:34 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.2.9: 
Remote RPC client disassociated. Likely due to containers exceeding thresholds, 
or network issues. Check driver logs for WARN messages.

19/08/16 13:51:35 ERROR TaskSchedulerImpl: Lost executor 2 on 192.168.2.8: 
Remote RPC client disassociated. Likely due to containers exceeding thresholds, 
or network issues. Check driver logs for WARN messages.

19/08/16 13:51:35 ERROR TaskSchedulerImpl: Lost executor 4 on 192.168.2.5: 
Remote RPC client disassociated. Likely due to containers exceeding thresholds, 
or network issues. Check driver logs for WARN messages.

19/08/16 13:51:35 ERROR TaskSchedulerImpl: Lost executor 0 on 192.168.2.7: 
Remote RPC client disassociated. Likely due to containers exceeding thresholds, 
or network issues. Check driver logs for WARN messages.

19/08/16 13:51:35 ERROR TaskSchedulerImpl: Lost executor 3 on 192.168.2.6: 
Remote RPC client disassociated. Likely due to containers exceeding thresholds, 
or network issues. Check driver logs for WARN messages.

19/08/16 13:51:43 ERROR TaskSchedulerImpl: Lost executor 5 on 192.168.2.9: 
Remote RPC client disassociated. Likely due to containers exceeding thresholds, 
or network issues. Check driver logs for WARN messages.

19/08/16 13:51:44 ERROR TaskSchedulerImpl: Lost executor 6 on 192.168.2.8: 
Remote RPC client disassociated. Likely due to containers exceeding thresholds, 
or network issues. Check driver logs for WARN messages.

19/08/16 13:51:44 ERROR TaskSetManager: Task 6 in stage 0.0 failed 4 times; 
aborting job

19/08/16 13:51:44 ERROR SparkHadoopWriter: Aborting job job_20190816135127_0001.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in 
stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage 0.0 (TID 
26, 192.168.2.8, executor 6): ExecutorLostFailure (executor 6 exited caused by 
one of the running tasks) Reason: Remote RPC client disassociated. Likely due 
to containers exceeding thresholds, or network issues. Check driver logs for 
WARN messages.

Driver stacktrace:

        at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1889)

        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1877)

        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1876)

        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)

        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)

        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)

        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)

        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:926)

        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:926)

        at scala.Option.foreach(Option.scala:274)

        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)

        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)

        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)

        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)

        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)

        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)

        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)

        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)

        at 
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)

        at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopDataset$1(PairRDDFunctions.scala:1083)

        at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)

        at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081)

        at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopFile$2(PairRDDFunctions.scala:1000)

        at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)

        at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:991)

        at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopFile$1(PairRDDFunctions.scala:979)

        at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)

        at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:979)

        at com.github.ehiggs.spark.terasort.TeraSort$.main(TeraSort.scala:63)

        at com.github.ehiggs.spark.terasort.TeraSort.main(TeraSort.scala)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:498)

        at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)

        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)

        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)

        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)

        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)

        at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)

        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)

        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Exception in thread "main" org.apache.spark.SparkException: Job aborted.

        at 
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:100)

        at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopDataset$1(PairRDDFunctions.scala:1083)

        at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)

        at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081)

        at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopFile$2(PairRDDFunctions.scala:1000)

        at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)

        at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:991)

        at 
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopFile$1(PairRDDFunctions.scala:979)

        at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)

        at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:979)

        at com.github.ehiggs.spark.terasort.TeraSort$.main(TeraSort.scala:63)

        at com.github.ehiggs.spark.terasort.TeraSort.main(TeraSort.scala)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:498)

        at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)

        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)

        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)

        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)

        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)

        at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)

        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)

        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 6 in stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage 
0.0 (TID 26, 192.168.2.8, executor 6): ExecutorLostFailure (executor 6 exited 
caused by one of the running tasks) Reason: Remote RPC client disassociated. 
Likely due to containers exceeding thresholds, or network issues. Check driver 
logs for WARN messages.

Driver stacktrace:

        at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1889)

        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1877)

        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1876)

        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)

        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)

        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)

        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)

        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:926)

        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:926)

        at scala.Option.foreach(Option.scala:274)

        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)

        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)

        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)

        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)

        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)

        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)

        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)

        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)

        at 
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)

        ... 32 more



hduser@master:/conf$ crail fs -ls -R /

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in 
[jar:file:/crail/jars/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in 
[jar:file:/crail/jars/jnvmf-1.7-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in 
[jar:file:/crail/jars/disni-2.1-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

19/08/16 14:00:29 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable

19/08/16 14:00:29 INFO crail: CrailHadoopFileSystem construction

19/08/16 14:00:30 INFO crail: creating singleton crail file system

19/08/16 14:00:30 INFO crail: crail.version 3101

19/08/16 14:00:30 INFO crail: crail.directorydepth 16

19/08/16 14:00:30 INFO crail: crail.tokenexpiration 10

19/08/16 14:00:30 INFO crail: crail.blocksize 1048576

19/08/16 14:00:30 INFO crail: crail.cachelimit 0

19/08/16 14:00:30 INFO crail: crail.cachepath /dev/hugepages/cache

19/08/16 14:00:30 INFO crail: crail.user crail

19/08/16 14:00:30 INFO crail: crail.shadowreplication 1

19/08/16 14:00:30 INFO crail: crail.debug true

19/08/16 14:00:30 INFO crail: crail.statistics true

19/08/16 14:00:30 INFO crail: crail.rpctimeout 1000

19/08/16 14:00:30 INFO crail: crail.datatimeout 1000

19/08/16 14:00:30 INFO crail: crail.buffersize 1048576

19/08/16 14:00:30 INFO crail: crail.slicesize 65536

19/08/16 14:00:30 INFO crail: crail.singleton true

19/08/16 14:00:30 INFO crail: crail.regionsize 1073741824

19/08/16 14:00:30 INFO crail: crail.directoryrecord 512

19/08/16 14:00:30 INFO crail: crail.directoryrandomize true

19/08/16 14:00:30 INFO crail: crail.cacheimpl 
org.apache.crail.memory.MappedBufferCache

19/08/16 14:00:30 INFO crail: crail.locationmap

19/08/16 14:00:30 INFO crail: crail.namenode.address crail://192.168.1.164:9060

19/08/16 14:00:30 INFO crail: crail.namenode.blockselection roundrobin

19/08/16 14:00:30 INFO crail: crail.namenode.fileblocks 16

19/08/16 14:00:30 INFO crail: crail.namenode.rpctype 
org.apache.crail.namenode.rpc.tcp.TcpNameNode

19/08/16 14:00:30 INFO crail: crail.namenode.log

19/08/16 14:00:30 INFO crail: crail.storage.types 
org.apache.crail.storage.nvmf.NvmfStorageTier

19/08/16 14:00:30 INFO crail: crail.storage.classes 1

19/08/16 14:00:30 INFO crail: crail.storage.rootclass 0

19/08/16 14:00:30 INFO crail: crail.storage.keepalive 2

19/08/16 14:00:30 INFO crail: buffer cache, allocationCount 0, bufferCount 1024

19/08/16 14:00:30 INFO crail: Initialize Nvmf storage client

19/08/16 14:00:30 INFO crail: crail.storage.nvmf.ip 192.168.2.100

19/08/16 14:00:30 INFO crail: crail.storage.nvmf.port 4420

19/08/16 14:00:30 INFO crail: crail.storage.nvmf.nqn 
nqn.2018-12.com.StorEdgeSystems:cntlr13

19/08/16 14:00:30 INFO crail: crail.storage.nvmf.hostnqn 
nqn.2014-08.org.nvmexpress:uuid:1b4e28ba-2fa1-11d2-883f-0016d3cca420

19/08/16 14:00:30 INFO crail: crail.storage.nvmf.allocationsize 1073741824

19/08/16 14:00:30 INFO crail: crail.storage.nvmf.queueSize 64

19/08/16 14:00:30 INFO narpc: new NaRPC server group v1.0, queueDepth 32, 
messageSize 512, nodealy true

19/08/16 14:00:30 INFO crail: crail.namenode.tcp.queueDepth 32

19/08/16 14:00:30 INFO crail: crail.namenode.tcp.messageSize 512

19/08/16 14:00:30 INFO crail: crail.namenode.tcp.cores 2

19/08/16 14:00:30 INFO crail: connected to namenode(s) /192.168.1.164:9060

19/08/16 14:00:30 INFO crail: CrailHadoopFileSystem fs initialization done..

19/08/16 14:00:30 INFO crail: lookupDirectory: path /

19/08/16 14:00:30 INFO crail: lookup: name /, success, fd 0

19/08/16 14:00:30 INFO crail: lookupDirectory: path /

19/08/16 14:00:30 INFO crail: lookup: name /, success, fd 0

19/08/16 14:00:30 INFO crail: getDirectoryList: /

19/08/16 14:00:30 INFO crail: CoreInputStream: open, path  /, fd 0, streamId 1, 
isDir true, readHint 0

19/08/16 14:00:30 INFO crail: Connecting to NVMf target at Transport address = 
/192.168.2.100:4420, subsystem NQN = nqn.2018-12.com.StorEdgeSystems:cntlr13

19/08/16 14:00:30 INFO crail: EndpointCache miss /192.168.2.100:4420, fsId 0, 
cache size 1

19/08/16 14:00:30 INFO crail: lookupDirectory: path /David

19/08/16 14:00:30 INFO crail: lookup: name /David, success, fd 1

19/08/16 14:00:30 INFO crail: lookupDirectory: path /data_sort

19/08/16 14:00:30 INFO crail: lookup: name /data_sort, success, fd 24128

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark

19/08/16 14:00:30 INFO crail: lookup: name /spark, success, fd 24131

19/08/16 14:00:30 INFO crail: lookupDirectory: path /data_in

19/08/16 14:00:30 INFO crail: lookup: name /data_in, success, fd 24118

19/08/16 14:00:30 INFO crail: CoreInputStream, close, path /, fd 0, streamId 1

drwxrwxrwx   - crail crail          0 2019-08-16 13:14 /David

19/08/16 14:00:30 INFO crail: lookupDirectory: path /David

19/08/16 14:00:30 INFO crail: lookup: name /David, success, fd 1

19/08/16 14:00:30 INFO crail: getDirectoryList: /David

19/08/16 14:00:30 INFO crail: CoreInputStream: open, path  /David, fd 1, 
streamId 2, isDir true, readHint 0

19/08/16 14:00:30 INFO crail: CoreInputStream, close, path /David, fd 1, 
streamId 2

drwxrwxrwx   - crail crail       2048 2019-08-16 13:50 /data_in

19/08/16 14:00:30 INFO crail: lookupDirectory: path /data_in

19/08/16 14:00:30 INFO crail: lookup: name /data_in, success, fd 24118

19/08/16 14:00:30 INFO crail: getDirectoryList: /data_in

19/08/16 14:00:30 INFO crail: CoreInputStream: open, path  /data_in, fd 24118, 
streamId 3, isDir true, readHint 0

19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0

19/08/16 14:00:30 INFO crail: lookupDirectory: path /data_in/_SUCCESS

19/08/16 14:00:30 INFO crail: lookup: name /data_in/_SUCCESS, success, fd 24127

19/08/16 14:00:30 INFO crail: lookupDirectory: path /data_in/part-r-00000

19/08/16 14:00:30 INFO crail: lookup: name /data_in/part-r-00000, success, fd 
24126

19/08/16 14:00:30 INFO crail: lookupDirectory: path /data_in/part-r-00001

19/08/16 14:00:30 INFO crail: lookup: name /data_in/part-r-00001, success, fd 
24125

19/08/16 14:00:30 INFO crail: CoreInputStream, close, path /data_in, fd 24118, 
streamId 3

-rw-rw-rw-   1 crail crail          0 2019-08-16 13:50 /data_in/_SUCCESS

-rw-rw-rw-   1 crail crail  500000000 2019-08-16 13:50 /data_in/part-r-00000

-rw-rw-rw-   1 crail crail  500000000 2019-08-16 13:50 /data_in/part-r-00001

drwxrwxrwx   - crail crail        512 2019-08-16 13:51 /data_sort

19/08/16 14:00:30 INFO crail: lookupDirectory: path /data_sort

19/08/16 14:00:30 INFO crail: lookup: name /data_sort, success, fd 24128

19/08/16 14:00:30 INFO crail: getDirectoryList: /data_sort

19/08/16 14:00:30 INFO crail: CoreInputStream: open, path  /data_sort, fd 
24128, streamId 4, isDir true, readHint 0

19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0

19/08/16 14:00:30 INFO crail: CoreInputStream, close, path /data_sort, fd 
24128, streamId 4

drwxrwxrwx   - crail crail       2560 2019-08-16 13:51 /spark

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark

19/08/16 14:00:30 INFO crail: lookup: name /spark, success, fd 24131

19/08/16 14:00:30 INFO crail: getDirectoryList: /spark

19/08/16 14:00:30 INFO crail: CoreInputStream: open, path  /spark, fd 24131, 
streamId 5, isDir true, readHint 0

19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/rdd

19/08/16 14:00:30 INFO crail: lookup: name /spark/rdd, success, fd 24134

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/tmp

19/08/16 14:00:30 INFO crail: lookup: name /spark/tmp, success, fd 24135

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/broadcast

19/08/16 14:00:30 INFO crail: lookup: name /spark/broadcast, success, fd 24132

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/meta

19/08/16 14:00:30 INFO crail: lookup: name /spark/meta, success, fd 24136

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/shuffle

19/08/16 14:00:30 INFO crail: lookup: name /spark/shuffle, success, fd 24133

19/08/16 14:00:30 INFO crail: CoreInputStream, close, path /spark, fd 24131, 
streamId 5

drwxrwxrwx   - crail crail          0 2019-08-16 13:51 /spark/broadcast

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/broadcast

19/08/16 14:00:30 INFO crail: lookup: name /spark/broadcast, success, fd 24132

19/08/16 14:00:30 INFO crail: getDirectoryList: /spark/broadcast

19/08/16 14:00:30 INFO crail: CoreInputStream: open, path  /spark/broadcast, fd 
24132, streamId 6, isDir true, readHint 0

19/08/16 14:00:30 INFO crail: CoreInputStream, close, path /spark/broadcast, fd 
24132, streamId 6

drwxrwxrwx   - crail crail        512 2019-08-16 13:51 /spark/meta

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/meta

19/08/16 14:00:30 INFO crail: lookup: name /spark/meta, success, fd 24136

19/08/16 14:00:30 INFO crail: getDirectoryList: /spark/meta

19/08/16 14:00:30 INFO crail: CoreInputStream: open, path  /spark/meta, fd 
24136, streamId 7, isDir true, readHint 0

19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/meta/hosts

19/08/16 14:00:30 INFO crail: lookup: name /spark/meta/hosts, success, fd 24137

19/08/16 14:00:30 INFO crail: CoreInputStream, close, path /spark/meta, fd 
24136, streamId 7

drwxrwxrwx   - crail crail       3072 2019-08-16 13:51 /spark/meta/hosts

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/meta/hosts

19/08/16 14:00:30 INFO crail: lookup: name /spark/meta/hosts, success, fd 24137

19/08/16 14:00:30 INFO crail: getDirectoryList: /spark/meta/hosts

19/08/16 14:00:30 INFO crail: CoreInputStream: open, path  /spark/meta/hosts, 
fd 24137, streamId 8, isDir true, readHint 0

19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/meta/hosts/35352998

19/08/16 14:00:30 INFO crail: lookup: name /spark/meta/hosts/35352998, success, 
fd 25096

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/meta/hosts/35352997

19/08/16 14:00:30 INFO crail: lookup: name /spark/meta/hosts/35352997, success, 
fd 25097

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/meta/hosts/35352995

19/08/16 14:00:30 INFO crail: lookup: name /spark/meta/hosts/35352995, success, 
fd 25098

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/meta/hosts/35352994

19/08/16 14:00:30 INFO crail: lookup: name /spark/meta/hosts/35352994, success, 
fd 25095

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/meta/hosts/35352996

19/08/16 14:00:30 INFO crail: lookup: name /spark/meta/hosts/35352996, success, 
fd 27836

19/08/16 14:00:30 INFO crail: lookupDirectory: path 
/spark/meta/hosts/-1081267614

19/08/16 14:00:30 INFO crail: lookup: name /spark/meta/hosts/-1081267614, 
success, fd 24138

19/08/16 14:00:30 INFO crail: CoreInputStream, close, path /spark/meta/hosts, 
fd 24137, streamId 8

-rw-rw-rw-   1 crail crail          0 2019-08-16 13:51 
/spark/meta/hosts/-1081267614

-rw-rw-rw-   1 crail crail          0 2019-08-16 13:51 
/spark/meta/hosts/35352994

-rw-rw-rw-   1 crail crail          0 2019-08-16 13:51 
/spark/meta/hosts/35352995

-rw-rw-rw-   1 crail crail          0 2019-08-16 13:51 
/spark/meta/hosts/35352996

-rw-rw-rw-   1 crail crail          0 2019-08-16 13:51 
/spark/meta/hosts/35352997

-rw-rw-rw-   1 crail crail          0 2019-08-16 13:51 
/spark/meta/hosts/35352998

drwxrwxrwx   - crail crail          0 2019-08-16 13:51 /spark/rdd

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/rdd

19/08/16 14:00:30 INFO crail: lookup: name /spark/rdd, success, fd 24134

19/08/16 14:00:30 INFO crail: getDirectoryList: /spark/rdd

19/08/16 14:00:30 INFO crail: CoreInputStream: open, path  /spark/rdd, fd 
24134, streamId 9, isDir true, readHint 0

19/08/16 14:00:30 INFO crail: CoreInputStream, close, path /spark/rdd, fd 
24134, streamId 9

drwxrwxrwx   - crail crail        512 2019-08-16 13:51 /spark/shuffle

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/shuffle

19/08/16 14:00:30 INFO crail: lookup: name /spark/shuffle, success, fd 24133

19/08/16 14:00:30 INFO crail: getDirectoryList: /spark/shuffle

19/08/16 14:00:30 INFO crail: CoreInputStream: open, path  /spark/shuffle, fd 
24133, streamId 10, isDir true, readHint 0

19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/shuffle/shuffle_0

19/08/16 14:00:30 INFO crail: lookup: name /spark/shuffle/shuffle_0, success, 
fd 24140

19/08/16 14:00:30 INFO crail: CoreInputStream, close, path /spark/shuffle, fd 
24133, streamId 10

drwxrwxrwx   - crail crail     488448 2019-08-16 13:51 /spark/shuffle/shuffle_0

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/shuffle/shuffle_0

19/08/16 14:00:30 INFO crail: lookup: name /spark/shuffle/shuffle_0, success, 
fd 24140

19/08/16 14:00:30 INFO crail: getDirectoryList: /spark/shuffle/shuffle_0

19/08/16 14:00:30 INFO crail: CoreInputStream: open, path  
/spark/shuffle/shuffle_0, fd 24140, streamId 11, isDir true, readHint 0

19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0

19/08/16 14:00:30 INFO crail: Cannot close, pending operations, opcount 1, path 
/spark/shuffle/shuffle_0

19/08/16 14:00:30 INFO crail: error when closing directory stream 
java.io.IOException: Cannot close, pending operations, opcount 1

drwxrwxrwx   - crail crail       5632 2019-08-16 13:51 /spark/tmp

19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/tmp

19/08/16 14:00:30 INFO crail: lookup: name /spark/tmp, success, fd 24135

19/08/16 14:00:30 INFO crail: getDirectoryList: /spark/tmp

19/08/16 14:00:30 INFO crail: CoreInputStream: open, path  /spark/tmp, fd 
24135, streamId 12, isDir true, readHint 0

19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0

19/08/16 14:00:30 INFO crail: CoreInputStream, close, path /spark/tmp, fd 
24135, streamId 12

19/08/16 14:00:30 INFO crail: Closing CrailHadoopFileSystem

19/08/16 14:00:30 INFO crail: Closing CrailFS singleton

19/08/16 14:00:30 INFO crail: Cannot close, pending operations, opcount 1, path 
/spark/shuffle/shuffle_0

java.io.IOException: java.io.IOException: Cannot close, pending operations, 
opcount 1

        at org.apache.crail.CrailStore.close(CrailStore.java:55)

        at 
org.apache.crail.hdfs.CrailHadoopFileSystem.close(CrailHadoopFileSystem.java:290)

        at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:2760)

        at 
org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:2777)

        at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

Caused by: java.io.IOException: Cannot close, pending operations, opcount 1

        at org.apache.crail.core.CoreInputStream.close(CoreInputStream.java:108)

        at 
org.apache.crail.core.CoreDataStore.closeFileSystem(CoreDataStore.java:515)

        at org.apache.crail.CrailStore.close(CrailStore.java:52)

        ... 4 more



Regards,



           David



C: 714-476-2692



________________________________
From: Jonas Pfefferle <[email protected]>
Sent: Friday, August 16, 2019 7:30:03 AM
To: [email protected] <[email protected]>; David Crespi 
<[email protected]>
Subject: Re: [GitHub] [incubator-crail] PepperJo opened a new pull request #82: 
[NVMf] Make keepalive thread a daemon thread

Thanks! You too.
Let me know how it goes.


Regards,
Jonas

  On Fri, 16 Aug 2019 14:22:43 +0000
  David Crespi <[email protected]> wrote:
> No, I haven’t tried the latest as there wasn’t any update to the
>bug, so I wasn’t sure if you were successful
>
> or not.  I will build new images with the latest and give it a shot.
> Thanks and enjoy your weekend… your
>
> a lot closer to it than me ??!
>
>
> Regards,
>
>
>           David
>
>
> ________________________________
>From: Jonas Pfefferle <[email protected]>
> Sent: Friday, August 16, 2019 12:24:30 AM
> To: [email protected] <[email protected]>; David Crespi
><[email protected]>
> Subject: Re: [GitHub] [incubator-crail] PepperJo opened a new pull
>request #82: [NVMf] Make keepalive thread a daemon thread
>
> Hi David,
>
>
> at least for me, this pull request fixes the closing problem with
>Spark.
> Did you experience the hang at the start before or just with the
>latest
> Crail version?
>
> Regards,
> Jonas
>
>  On Thu, 15 Aug 2019 19:25:05 +0000
>  David Crespi <[email protected]> wrote:
>> Hi Jonas,
>>
>> Did you ever get this to work?  I see the bug is still open, and no
>>update there.
>>
>> Is there a way to work around this?  It appears when the file size
>>of teragen is large,
>>
>> then crail just hangs when starting the sort.  I’ve also tried to do
>>it in parts,
>>
>> with the same results.
>>
>>
>> Regards,
>>
>>
>>           David
>>
>>
>> ________________________________
>>From: GitBox <[email protected]>
>> Sent: Monday, August 5, 2019 1:41:10 AM
>> To: [email protected] <[email protected]>
>> Subject: [GitHub] [incubator-crail] PepperJo opened a new pull
>>request #82: [NVMf] Make keepalive thread a daemon thread
>>
>> PepperJo opened a new pull request #82: [NVMf] Make keepalive thread
>>a daemon thread
>> URL: https://github.com/apache/incubator-crail/pull/82
>>
>>
>>   Daemonize the keepalive thread to allow applications to
>>   exit when the main method returns without closing the
>>   storage client explicitly. For example, Spark has this
>>   requirement.
>>
>>   https://issues.apache.org/jira/browse/CRAIL-98
>>
>>   Signed-off-by: Jonas Pfefferle <[email protected]>
>>
>> ----------------------------------------------------------------
>> This is an automated message from the Apache Git Service.
>> To respond to the message, please log on to GitHub and use the
>> URL above to go to the specific comment.
>>
>>For queries about this service, please contact Infrastructure at:
>> [email protected]
>>
>>
>> With regards,
>> Apache Git Services
>
>

RE: [GitHub] [incubator-crail] PepperJo opened a new pull request #82: [NVMf] Make keepalive thread a daemon thread

Reply via email to