Jonas,
I don’t know if this is your problem or not, so I thought I provide
some info.
This behavior started after running the latest crail code, and it
does appear
to have something to do with opening and closing files.
A few more things to note.
Running terasort with hdfs for tmp data & crail shuffle is okay
Running terrasort with crail for tmp data & not using crail shuffle
is okay
Running terasort with crail for tmp data & crail shuffle causes this
problem.
So when crail is used for everything, it falls apart.
What spdk is showing:
Aug 16 14:00:30 minnie a3f92fbd8b97[2136]:
rdma.c:1422:spdk_nvmf_rdma_request_parse_sgl: *ERROR*: SGL length
0x77400 exceeds max io size 0x20000
Aug 16 14:00:30 minnie a3f92fbd8b97[2136]:
rdma.c:2501:spdk_nvmf_process_ib_event: *NOTICE*: Async event: last
WQE reached
Aug 16 14:00:30 minnie a3f92fbd8b97[2136]:
rdma.c:2501:spdk_nvmf_process_ib_event: *NOTICE*: Async event: last
WQE reached
From the client
************************************
TeraGen
************************************
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/crail/jars/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/crail/jars/jnvmf-1.7-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/crail/jars/disni-2.1-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/spark-2.4.2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/08/16 13:50:11 WARN NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
===========================================================================
===========================================================================
Input size: 1000MB
Total number of records: 10000000
Number of output partitions: 2
Number of records/output partition: 5000000
===========================================================================
===========================================================================
Number of records written: 10000000
real 0m13.046s
user 0m13.633s
sys 0m1.578s
************************************
TeraSort
************************************
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/crail/jars/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/crail/jars/jnvmf-1.7-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/crail/jars/disni-2.1-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/spark-2.4.2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/08/16 13:50:24 WARN NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
19/08/16 13:51:34 ERROR TaskSchedulerImpl: Lost executor 1 on
192.168.2.9: Remote RPC client disassociated. Likely due to
containers exceeding thresholds, or network issues. Check driver logs
for WARN messages.
19/08/16 13:51:35 ERROR TaskSchedulerImpl: Lost executor 2 on
192.168.2.8: Remote RPC client disassociated. Likely due to
containers exceeding thresholds, or network issues. Check driver logs
for WARN messages.
19/08/16 13:51:35 ERROR TaskSchedulerImpl: Lost executor 4 on
192.168.2.5: Remote RPC client disassociated. Likely due to
containers exceeding thresholds, or network issues. Check driver logs
for WARN messages.
19/08/16 13:51:35 ERROR TaskSchedulerImpl: Lost executor 0 on
192.168.2.7: Remote RPC client disassociated. Likely due to
containers exceeding thresholds, or network issues. Check driver logs
for WARN messages.
19/08/16 13:51:35 ERROR TaskSchedulerImpl: Lost executor 3 on
192.168.2.6: Remote RPC client disassociated. Likely due to
containers exceeding thresholds, or network issues. Check driver logs
for WARN messages.
19/08/16 13:51:43 ERROR TaskSchedulerImpl: Lost executor 5 on
192.168.2.9: Remote RPC client disassociated. Likely due to
containers exceeding thresholds, or network issues. Check driver logs
for WARN messages.
19/08/16 13:51:44 ERROR TaskSchedulerImpl: Lost executor 6 on
192.168.2.8: Remote RPC client disassociated. Likely due to
containers exceeding thresholds, or network issues. Check driver logs
for WARN messages.
19/08/16 13:51:44 ERROR TaskSetManager: Task 6 in stage 0.0 failed 4
times; aborting job
19/08/16 13:51:44 ERROR SparkHadoopWriter: Aborting job
job_20190816135127_0001.
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 6 in stage 0.0 failed 4 times, most recent failure: Lost task
6.3 in stage 0.0 (TID 26, 192.168.2.8, executor 6):
ExecutorLostFailure (executor 6 exited caused by one of the running
tasks) Reason: Remote RPC client disassociated. Likely due to
containers exceeding thresholds, or network issues. Check driver logs
for WARN messages.
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1889)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1877)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1876)
at
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:926)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:274)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
at
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
at
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopDataset$1(PairRDDFunctions.scala:1083)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081)
at
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopFile$2(PairRDDFunctions.scala:1000)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:991)
at
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopFile$1(PairRDDFunctions.scala:979)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:979)
at
com.github.ehiggs.spark.terasort.TeraSort$.main(TeraSort.scala:63)
at
com.github.ehiggs.spark.terasort.TeraSort.main(TeraSort.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at
org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Exception in thread "main" org.apache.spark.SparkException: Job
aborted.
at
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:100)
at
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopDataset$1(PairRDDFunctions.scala:1083)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081)
at
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopFile$2(PairRDDFunctions.scala:1000)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:991)
at
org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsNewAPIHadoopFile$1(PairRDDFunctions.scala:979)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:979)
at
com.github.ehiggs.spark.terasort.TeraSort$.main(TeraSort.scala:63)
at
com.github.ehiggs.spark.terasort.TeraSort.main(TeraSort.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at
org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 6 in stage 0.0 failed 4 times, most recent failure:
Lost task 6.3 in stage 0.0 (TID 26, 192.168.2.8, executor 6):
ExecutorLostFailure (executor 6 exited caused by one of the running
tasks) Reason: Remote RPC client disassociated. Likely due to
containers exceeding thresholds, or network issues. Check driver logs
for WARN messages.
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1889)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1877)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1876)
at
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:926)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:274)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
at
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
... 32 more
hduser@master:/conf$ crail fs -ls -R /
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/crail/jars/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/crail/jars/jnvmf-1.7-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/crail/jars/disni-2.1-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/08/16 14:00:29 WARN NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
19/08/16 14:00:29 INFO crail: CrailHadoopFileSystem construction
19/08/16 14:00:30 INFO crail: creating singleton crail file system
19/08/16 14:00:30 INFO crail: crail.version 3101
19/08/16 14:00:30 INFO crail: crail.directorydepth 16
19/08/16 14:00:30 INFO crail: crail.tokenexpiration 10
19/08/16 14:00:30 INFO crail: crail.blocksize 1048576
19/08/16 14:00:30 INFO crail: crail.cachelimit 0
19/08/16 14:00:30 INFO crail: crail.cachepath /dev/hugepages/cache
19/08/16 14:00:30 INFO crail: crail.user crail
19/08/16 14:00:30 INFO crail: crail.shadowreplication 1
19/08/16 14:00:30 INFO crail: crail.debug true
19/08/16 14:00:30 INFO crail: crail.statistics true
19/08/16 14:00:30 INFO crail: crail.rpctimeout 1000
19/08/16 14:00:30 INFO crail: crail.datatimeout 1000
19/08/16 14:00:30 INFO crail: crail.buffersize 1048576
19/08/16 14:00:30 INFO crail: crail.slicesize 65536
19/08/16 14:00:30 INFO crail: crail.singleton true
19/08/16 14:00:30 INFO crail: crail.regionsize 1073741824
19/08/16 14:00:30 INFO crail: crail.directoryrecord 512
19/08/16 14:00:30 INFO crail: crail.directoryrandomize true
19/08/16 14:00:30 INFO crail: crail.cacheimpl
org.apache.crail.memory.MappedBufferCache
19/08/16 14:00:30 INFO crail: crail.locationmap
19/08/16 14:00:30 INFO crail: crail.namenode.address
crail://192.168.1.164:9060
19/08/16 14:00:30 INFO crail: crail.namenode.blockselection
roundrobin
19/08/16 14:00:30 INFO crail: crail.namenode.fileblocks 16
19/08/16 14:00:30 INFO crail: crail.namenode.rpctype
org.apache.crail.namenode.rpc.tcp.TcpNameNode
19/08/16 14:00:30 INFO crail: crail.namenode.log
19/08/16 14:00:30 INFO crail: crail.storage.types
org.apache.crail.storage.nvmf.NvmfStorageTier
19/08/16 14:00:30 INFO crail: crail.storage.classes 1
19/08/16 14:00:30 INFO crail: crail.storage.rootclass 0
19/08/16 14:00:30 INFO crail: crail.storage.keepalive 2
19/08/16 14:00:30 INFO crail: buffer cache, allocationCount 0,
bufferCount 1024
19/08/16 14:00:30 INFO crail: Initialize Nvmf storage client
19/08/16 14:00:30 INFO crail: crail.storage.nvmf.ip 192.168.2.100
19/08/16 14:00:30 INFO crail: crail.storage.nvmf.port 4420
19/08/16 14:00:30 INFO crail: crail.storage.nvmf.nqn
nqn.2018-12.com.StorEdgeSystems:cntlr13
19/08/16 14:00:30 INFO crail: crail.storage.nvmf.hostnqn
nqn.2014-08.org.nvmexpress:uuid:1b4e28ba-2fa1-11d2-883f-0016d3cca420
19/08/16 14:00:30 INFO crail: crail.storage.nvmf.allocationsize
1073741824
19/08/16 14:00:30 INFO crail: crail.storage.nvmf.queueSize 64
19/08/16 14:00:30 INFO narpc: new NaRPC server group v1.0,
queueDepth 32, messageSize 512, nodealy true
19/08/16 14:00:30 INFO crail: crail.namenode.tcp.queueDepth 32
19/08/16 14:00:30 INFO crail: crail.namenode.tcp.messageSize 512
19/08/16 14:00:30 INFO crail: crail.namenode.tcp.cores 2
19/08/16 14:00:30 INFO crail: connected to namenode(s)
/192.168.1.164:9060
19/08/16 14:00:30 INFO crail: CrailHadoopFileSystem fs
initialization done..
19/08/16 14:00:30 INFO crail: lookupDirectory: path /
19/08/16 14:00:30 INFO crail: lookup: name /, success, fd 0
19/08/16 14:00:30 INFO crail: lookupDirectory: path /
19/08/16 14:00:30 INFO crail: lookup: name /, success, fd 0
19/08/16 14:00:30 INFO crail: getDirectoryList: /
19/08/16 14:00:30 INFO crail: CoreInputStream: open, path /, fd 0,
streamId 1, isDir true, readHint 0
19/08/16 14:00:30 INFO crail: Connecting to NVMf target at Transport
address = /192.168.2.100:4420, subsystem NQN =
nqn.2018-12.com.StorEdgeSystems:cntlr13
19/08/16 14:00:30 INFO crail: EndpointCache miss
/192.168.2.100:4420, fsId 0, cache size 1
19/08/16 14:00:30 INFO crail: lookupDirectory: path /David
19/08/16 14:00:30 INFO crail: lookup: name /David, success, fd 1
19/08/16 14:00:30 INFO crail: lookupDirectory: path /data_sort
19/08/16 14:00:30 INFO crail: lookup: name /data_sort, success, fd
24128
19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark
19/08/16 14:00:30 INFO crail: lookup: name /spark, success, fd 24131
19/08/16 14:00:30 INFO crail: lookupDirectory: path /data_in
19/08/16 14:00:30 INFO crail: lookup: name /data_in, success, fd
24118
19/08/16 14:00:30 INFO crail: CoreInputStream, close, path /, fd 0,
streamId 1
drwxrwxrwx - crail crail 0 2019-08-16 13:14 /David
19/08/16 14:00:30 INFO crail: lookupDirectory: path /David
19/08/16 14:00:30 INFO crail: lookup: name /David, success, fd 1
19/08/16 14:00:30 INFO crail: getDirectoryList: /David
19/08/16 14:00:30 INFO crail: CoreInputStream: open, path /David,
fd 1, streamId 2, isDir true, readHint 0
19/08/16 14:00:30 INFO crail: CoreInputStream, close, path /David,
fd 1, streamId 2
drwxrwxrwx - crail crail 2048 2019-08-16 13:50 /data_in
19/08/16 14:00:30 INFO crail: lookupDirectory: path /data_in
19/08/16 14:00:30 INFO crail: lookup: name /data_in, success, fd
24118
19/08/16 14:00:30 INFO crail: getDirectoryList: /data_in
19/08/16 14:00:30 INFO crail: CoreInputStream: open, path /data_in,
fd 24118, streamId 3, isDir true, readHint 0
19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420,
fsId 0
19/08/16 14:00:30 INFO crail: lookupDirectory: path
/data_in/_SUCCESS
19/08/16 14:00:30 INFO crail: lookup: name /data_in/_SUCCESS,
success, fd 24127
19/08/16 14:00:30 INFO crail: lookupDirectory: path
/data_in/part-r-00000
19/08/16 14:00:30 INFO crail: lookup: name /data_in/part-r-00000,
success, fd 24126
19/08/16 14:00:30 INFO crail: lookupDirectory: path
/data_in/part-r-00001
19/08/16 14:00:30 INFO crail: lookup: name /data_in/part-r-00001,
success, fd 24125
19/08/16 14:00:30 INFO crail: CoreInputStream, close, path /data_in,
fd 24118, streamId 3
-rw-rw-rw- 1 crail crail 0 2019-08-16 13:50
/data_in/_SUCCESS
-rw-rw-rw- 1 crail crail 500000000 2019-08-16 13:50
/data_in/part-r-00000
-rw-rw-rw- 1 crail crail 500000000 2019-08-16 13:50
/data_in/part-r-00001
drwxrwxrwx - crail crail 512 2019-08-16 13:51 /data_sort
19/08/16 14:00:30 INFO crail: lookupDirectory: path /data_sort
19/08/16 14:00:30 INFO crail: lookup: name /data_sort, success, fd
24128
19/08/16 14:00:30 INFO crail: getDirectoryList: /data_sort
19/08/16 14:00:30 INFO crail: CoreInputStream: open, path
/data_sort, fd 24128, streamId 4, isDir true, readHint 0
19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420,
fsId 0
19/08/16 14:00:30 INFO crail: CoreInputStream, close, path
/data_sort, fd 24128, streamId 4
drwxrwxrwx - crail crail 2560 2019-08-16 13:51 /spark
19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark
19/08/16 14:00:30 INFO crail: lookup: name /spark, success, fd 24131
19/08/16 14:00:30 INFO crail: getDirectoryList: /spark
19/08/16 14:00:30 INFO crail: CoreInputStream: open, path /spark,
fd 24131, streamId 5, isDir true, readHint 0
19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420,
fsId 0
19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/rdd
19/08/16 14:00:30 INFO crail: lookup: name /spark/rdd, success, fd
24134
19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/tmp
19/08/16 14:00:30 INFO crail: lookup: name /spark/tmp, success, fd
24135
19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/broadcast
19/08/16 14:00:30 INFO crail: lookup: name /spark/broadcast,
success, fd 24132
19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/meta
19/08/16 14:00:30 INFO crail: lookup: name /spark/meta, success, fd
24136
19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/shuffle
19/08/16 14:00:30 INFO crail: lookup: name /spark/shuffle, success,
fd 24133
19/08/16 14:00:30 INFO crail: CoreInputStream, close, path /spark,
fd 24131, streamId 5
drwxrwxrwx - crail crail 0 2019-08-16 13:51
/spark/broadcast
19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/broadcast
19/08/16 14:00:30 INFO crail: lookup: name /spark/broadcast,
success, fd 24132
19/08/16 14:00:30 INFO crail: getDirectoryList: /spark/broadcast
19/08/16 14:00:30 INFO crail: CoreInputStream: open, path
/spark/broadcast, fd 24132, streamId 6, isDir true, readHint 0
19/08/16 14:00:30 INFO crail: CoreInputStream, close, path
/spark/broadcast, fd 24132, streamId 6
drwxrwxrwx - crail crail 512 2019-08-16 13:51 /spark/meta
19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/meta
19/08/16 14:00:30 INFO crail: lookup: name /spark/meta, success, fd
24136
19/08/16 14:00:30 INFO crail: getDirectoryList: /spark/meta
19/08/16 14:00:30 INFO crail: CoreInputStream: open, path
/spark/meta, fd 24136, streamId 7, isDir true, readHint 0
19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420,
fsId 0
19/08/16 14:00:30 INFO crail: lookupDirectory: path
/spark/meta/hosts
19/08/16 14:00:30 INFO crail: lookup: name /spark/meta/hosts,
success, fd 24137
19/08/16 14:00:30 INFO crail: CoreInputStream, close, path
/spark/meta, fd 24136, streamId 7
drwxrwxrwx - crail crail 3072 2019-08-16 13:51
/spark/meta/hosts
19/08/16 14:00:30 INFO crail: lookupDirectory: path
/spark/meta/hosts
19/08/16 14:00:30 INFO crail: lookup: name /spark/meta/hosts,
success, fd 24137
19/08/16 14:00:30 INFO crail: getDirectoryList: /spark/meta/hosts
19/08/16 14:00:30 INFO crail: CoreInputStream: open, path
/spark/meta/hosts, fd 24137, streamId 8, isDir true, readHint 0
19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420,
fsId 0
19/08/16 14:00:30 INFO crail: lookupDirectory: path
/spark/meta/hosts/35352998
19/08/16 14:00:30 INFO crail: lookup: name
/spark/meta/hosts/35352998, success, fd 25096
19/08/16 14:00:30 INFO crail: lookupDirectory: path
/spark/meta/hosts/35352997
19/08/16 14:00:30 INFO crail: lookup: name
/spark/meta/hosts/35352997, success, fd 25097
19/08/16 14:00:30 INFO crail: lookupDirectory: path
/spark/meta/hosts/35352995
19/08/16 14:00:30 INFO crail: lookup: name
/spark/meta/hosts/35352995, success, fd 25098
19/08/16 14:00:30 INFO crail: lookupDirectory: path
/spark/meta/hosts/35352994
19/08/16 14:00:30 INFO crail: lookup: name
/spark/meta/hosts/35352994, success, fd 25095
19/08/16 14:00:30 INFO crail: lookupDirectory: path
/spark/meta/hosts/35352996
19/08/16 14:00:30 INFO crail: lookup: name
/spark/meta/hosts/35352996, success, fd 27836
19/08/16 14:00:30 INFO crail: lookupDirectory: path
/spark/meta/hosts/-1081267614
19/08/16 14:00:30 INFO crail: lookup: name
/spark/meta/hosts/-1081267614, success, fd 24138
19/08/16 14:00:30 INFO crail: CoreInputStream, close, path
/spark/meta/hosts, fd 24137, streamId 8
-rw-rw-rw- 1 crail crail 0 2019-08-16 13:51
/spark/meta/hosts/-1081267614
-rw-rw-rw- 1 crail crail 0 2019-08-16 13:51
/spark/meta/hosts/35352994
-rw-rw-rw- 1 crail crail 0 2019-08-16 13:51
/spark/meta/hosts/35352995
-rw-rw-rw- 1 crail crail 0 2019-08-16 13:51
/spark/meta/hosts/35352996
-rw-rw-rw- 1 crail crail 0 2019-08-16 13:51
/spark/meta/hosts/35352997
-rw-rw-rw- 1 crail crail 0 2019-08-16 13:51
/spark/meta/hosts/35352998
drwxrwxrwx - crail crail 0 2019-08-16 13:51 /spark/rdd
19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/rdd
19/08/16 14:00:30 INFO crail: lookup: name /spark/rdd, success, fd
24134
19/08/16 14:00:30 INFO crail: getDirectoryList: /spark/rdd
19/08/16 14:00:30 INFO crail: CoreInputStream: open, path
/spark/rdd, fd 24134, streamId 9, isDir true, readHint 0
19/08/16 14:00:30 INFO crail: CoreInputStream, close, path
/spark/rdd, fd 24134, streamId 9
drwxrwxrwx - crail crail 512 2019-08-16 13:51
/spark/shuffle
19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/shuffle
19/08/16 14:00:30 INFO crail: lookup: name /spark/shuffle, success,
fd 24133
19/08/16 14:00:30 INFO crail: getDirectoryList: /spark/shuffle
19/08/16 14:00:30 INFO crail: CoreInputStream: open, path
/spark/shuffle, fd 24133, streamId 10, isDir true, readHint 0
19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420,
fsId 0
19/08/16 14:00:30 INFO crail: lookupDirectory: path
/spark/shuffle/shuffle_0
19/08/16 14:00:30 INFO crail: lookup: name /spark/shuffle/shuffle_0,
success, fd 24140
19/08/16 14:00:30 INFO crail: CoreInputStream, close, path
/spark/shuffle, fd 24133, streamId 10
drwxrwxrwx - crail crail 488448 2019-08-16 13:51
/spark/shuffle/shuffle_0
19/08/16 14:00:30 INFO crail: lookupDirectory: path
/spark/shuffle/shuffle_0
19/08/16 14:00:30 INFO crail: lookup: name /spark/shuffle/shuffle_0,
success, fd 24140
19/08/16 14:00:30 INFO crail: getDirectoryList:
/spark/shuffle/shuffle_0
19/08/16 14:00:30 INFO crail: CoreInputStream: open, path
/spark/shuffle/shuffle_0, fd 24140, streamId 11, isDir true,
readHint 0
19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420,
fsId 0
19/08/16 14:00:30 INFO crail: Cannot close, pending operations,
opcount 1, path /spark/shuffle/shuffle_0
19/08/16 14:00:30 INFO crail: error when closing directory stream
java.io.IOException: Cannot close, pending operations, opcount 1
drwxrwxrwx - crail crail 5632 2019-08-16 13:51 /spark/tmp
19/08/16 14:00:30 INFO crail: lookupDirectory: path /spark/tmp
19/08/16 14:00:30 INFO crail: lookup: name /spark/tmp, success, fd
24135
19/08/16 14:00:30 INFO crail: getDirectoryList: /spark/tmp
19/08/16 14:00:30 INFO crail: CoreInputStream: open, path
/spark/tmp, fd 24135, streamId 12, isDir true, readHint 0
19/08/16 14:00:30 INFO crail: EndpointCache hit /192.168.2.100:4420,
fsId 0
19/08/16 14:00:30 INFO crail: CoreInputStream, close, path
/spark/tmp, fd 24135, streamId 12
19/08/16 14:00:30 INFO crail: Closing CrailHadoopFileSystem
19/08/16 14:00:30 INFO crail: Closing CrailFS singleton
19/08/16 14:00:30 INFO crail: Cannot close, pending operations,
opcount 1, path /spark/shuffle/shuffle_0
java.io.IOException: java.io.IOException: Cannot close, pending
operations, opcount 1
at org.apache.crail.CrailStore.close(CrailStore.java:55)
at
org.apache.crail.hdfs.CrailHadoopFileSystem.close(CrailHadoopFileSystem.java:290)
at
org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:2760)
at
org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:2777)
at
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
Caused by: java.io.IOException: Cannot close, pending operations,
opcount 1
at
org.apache.crail.core.CoreInputStream.close(CoreInputStream.java:108)
at
org.apache.crail.core.CoreDataStore.closeFileSystem(CoreDataStore.java:515)
at org.apache.crail.CrailStore.close(CrailStore.java:52)
... 4 more
Regards,
David
C: 714-476-2692
________________________________
From: Jonas Pfefferle <[email protected]>
Sent: Friday, August 16, 2019 7:30:03 AM
To: [email protected] <[email protected]>; David Crespi
<[email protected]>
Subject: Re: [GitHub] [incubator-crail] PepperJo opened a new pull
request #82: [NVMf] Make keepalive thread a daemon thread
Thanks! You too.
Let me know how it goes.
Regards,
Jonas
On Fri, 16 Aug 2019 14:22:43 +0000
David Crespi <[email protected]> wrote:
No, I haven’t tried the latest as there wasn’t any update to the
bug, so I wasn’t sure if you were successful
or not. I will build new images with the latest and give it a shot.
Thanks and enjoy your weekend… your
a lot closer to it than me ??!
Regards,
David
________________________________
From: Jonas Pfefferle <[email protected]>
Sent: Friday, August 16, 2019 12:24:30 AM
To: [email protected] <[email protected]>; David Crespi
<[email protected]>
Subject: Re: [GitHub] [incubator-crail] PepperJo opened a new pull
request #82: [NVMf] Make keepalive thread a daemon thread
Hi David,
at least for me, this pull request fixes the closing problem with
Spark.
Did you experience the hang at the start before or just with the
latest
Crail version?
Regards,
Jonas
On Thu, 15 Aug 2019 19:25:05 +0000
David Crespi <[email protected]> wrote:
Hi Jonas,
Did you ever get this to work? I see the bug is still open, and no
update there.
Is there a way to work around this? It appears when the file size
of teragen is large,
then crail just hangs when starting the sort. I’ve also tried to do
it in parts,
with the same results.
Regards,
David
________________________________
From: GitBox <[email protected]>
Sent: Monday, August 5, 2019 1:41:10 AM
To: [email protected] <[email protected]>
Subject: [GitHub] [incubator-crail] PepperJo opened a new pull
request #82: [NVMf] Make keepalive thread a daemon thread
PepperJo opened a new pull request #82: [NVMf] Make keepalive thread
a daemon thread
URL: https://github.com/apache/incubator-crail/pull/82
Daemonize the keepalive thread to allow applications to
exit when the main method returns without closing the
storage client explicitly. For example, Spark has this
requirement.
https://issues.apache.org/jira/browse/CRAIL-98
Signed-off-by: Jonas Pfefferle <[email protected]>
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services