Hello Mahesh, Pat Thanks for your answers. I will try with a bigger EC2 instance.
Carlos. 2017-08-02 18:42 GMT+02:00 Pat Ferrel <p...@occamsmachete.com>: > Actually memory may be your problem. Mahesh Hegde may be right about > trying smaller sets. Since it sounds like you have all services running on > one machine, they may be in contention for resources. > > > On Aug 2, 2017, at 9:35 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > > Something is not configured correctly `pio import` should work with any > size of file but this may be an undersized instance for that much data. > > Spark needs memory, it keeps all data that it needs for a particular > calculation spread across all cluster machines in memory. That includes > derived data so a total of 32g may not be enough. But that is not your > current problem. > > I would start by verifying that all components are working properly, > starting with HDFS, then HBase, then Spark, then Elasticsearch. I see > several storage backend errors below. > > > > On Aug 2, 2017, at 4:52 AM, Carlos Vidal <carlos.vi...@beeva.com> wrote: > > Hello, > > I have installed the pio + ur AMI in AWS, in an m4.2xlarge instance with > 32GB of RAM and 8 VCPU. > > When I try to import a 20GB events file por my application, the system > crashes. The command I have used is: > > > pio import --appid 4 --input my_events.json > > this command launch an spark job that needs to perform 800 task. When the > process reaches the task 211 it crashes. This is what I can see in my > pio.log file: > > 2017-08-02 11:16:17,101 WARN org.apache.hadoop.hbase.clien > t.HConnectionManager$HConnectionImplementation [htable-pool230-t1] - > Encountered problems when prefetch hbase:meta table: > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=35, exceptions: > Wed Aug 02 11:07:06 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, > org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: > This server is in the failed servers list: localhost/127.0.0.1:44866 > Wed Aug 02 11:07:07 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, > org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: > This server is in the failed servers list: localhost/127.0.0.1:44866 > Wed Aug 02 11:07:07 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, > org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: > This server is in the failed servers list: localhost/127.0.0.1:44866 > Wed Aug 02 11:07:08 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, > org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: > This server is in the failed servers list: localhost/127.0.0.1:44866 > Wed Aug 02 11:07:10 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:07:14 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:07:24 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:07:34 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:07:44 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:07:54 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:08:15 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:08:35 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:08:55 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:09:15 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:09:35 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:09:55 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:10:15 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:10:35 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:10:55 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:11:15 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:11:35 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:11:55 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:12:15 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:12:35 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:12:56 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:13:16 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:13:36 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:13:56 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:14:16 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:14:36 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:14:56 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:15:16 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:15:36 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:15:56 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > Wed Aug 02 11:16:17 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused > > at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRet > ries(RpcRetryingCaller.java:129) > at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:714) > at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScan > ner.java:144) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectio > nImplementation.prefetchRegionCache(HConnectionManager.java:1153) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectio > nImplementation.locateRegionInMeta(HConnectionManager.java:1217) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectio > nImplementation.locateRegion(HConnectionManager.java:1105) > at org.apache.hadoop.hbase.client.HConnectionManager$HConnectio > nImplementation.locateRegion(HConnectionManager.java:1062) > at org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation > (AsyncProcess.java:365) > at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProc > ess.java:507) > at org.apache.hadoop.hbase.client.AsyncProcess.logAndResubmit( > AsyncProcess.java:717) > at org.apache.hadoop.hbase.client.AsyncProcess.receiveGlobalFai > lure(AsyncProcess.java:664) > at org.apache.hadoop.hbase.client.AsyncProcess.access$100( > AsyncProcess.java:93) > at org.apache.hadoop.hbase.client.AsyncProcess$1.run(AsyncProce > ss.java:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool > Executor.java:1149) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo > lExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWi > thTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) > at org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupConnec > tion(RpcClient.java:578) > at org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstre > ams(RpcClient.java:868) > at org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClien > t.java:1543) > at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1442) > at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(Rpc > Client.java:1661) > at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImpl > ementation.callBlockingMethod(RpcClient.java:1719) > at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ > ClientService$BlockingStub.get(ClientProtos.java:29966) > at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore > (ProtobufUtil.java:1508) > at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:710) > at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:708) > at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRet > ries(RpcRetryingCaller.java:114) > ... 17 more > 2017-08-02 11:21:04,430 ERROR org.apache.spark.scheduler.LiveListenerBus > [Thread-3] - SparkListenerBus has already stopped! Dropping event > SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@66c4a5d2) > 2017-08-02 11:21:04,431 ERROR org.apache.spark.scheduler.LiveListenerBus > [Thread-3] - SparkListenerBus has already stopped! Dropping event > SparkListenerJobEnd(0,1501672864431,JobFailed(org.apache.spark.SparkException: > Job 0 cancelled because SparkContext was shut down)) > 2017-08-02 11:28:47,129 INFO > org.apache.predictionio.tools.commands.Management$ > [main] - Inspecting PredictionIO... > 2017-08-02 11:28:47,132 INFO > org.apache.predictionio.tools.commands.Management$ > [main] - PredictionIO 0.11.0-incubating is installed at > /opt/data/PredictionIO-0.11.0-incubating > 2017-08-02 11:28:47,132 INFO > org.apache.predictionio.tools.commands.Management$ > [main] - Inspecting Apache Spark... > 2017-08-02 11:28:47,142 INFO > org.apache.predictionio.tools.commands.Management$ > [main] - Apache Spark is installed at /usr/local/spark > 2017-08-02 11:28:47,175 INFO > org.apache.predictionio.tools.commands.Management$ > [main] - Apache Spark 1.6.3 detected (meets minimum requirement of 1.3.0) > 2017-08-02 11:28:47,175 INFO > org.apache.predictionio.tools.commands.Management$ > [main] - Inspecting storage backend connections... > 2017-08-02 11:28:47,195 INFO org.apache.predictionio.data.storage.Storage$ > [main] - Verifying Meta Data Backend (Source: ELASTICSEARCH)... > 2017-08-02 11:28:48,225 INFO org.apache.predictionio.data.storage.Storage$ > [main] - Verifying Model Data Backend (Source: HDFS)... > 2017-08-02 11:28:48,447 INFO org.apache.predictionio.data.storage.Storage$ > [main] - Verifying Event Data Backend (Source: HBASE)... > 2017-08-02 11:28:48,979 INFO org.apache.predictionio.data.storage.Storage$ > [main] - Test writing to Event Store (App Id 0)... > 2017-08-02 11:29:49,026 ERROR > org.apache.predictionio.tools.commands.Management$ > [main] - Unable to connect to all storage backends successfully. > > > > > > > On the other hand, once this happens, if I run pio status this is what I > obtain: > > aml@ip-10-41-11-227:~$ pio status > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in [jar:file:/opt/data/Prediction > IO-0.11.0-incubating/lib/spark/pio-data-hdfs-assembly- > 0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in [jar:file:/opt/data/Prediction > IO-0.11.0-incubating/lib/pio-assembly-0.11.0-incubating. > jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > [INFO] [Management$] Inspecting PredictionIO... > [INFO] [Management$] PredictionIO 0.11.0-incubating is installed at > /opt/data/PredictionIO-0.11.0-incubating > [INFO] [Management$] Inspecting Apache Spark... > [INFO] [Management$] Apache Spark is installed at /usr/local/spark > [INFO] [Management$] Apache Spark 1.6.3 detected (meets minimum > requirement of 1.3.0) > [INFO] [Management$] Inspecting storage backend connections... > [INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)... > [INFO] [Storage$] Verifying Model Data Backend (Source: HDFS)... > [INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)... > [INFO] [Storage$] Test writing to Event Store (App Id 0)... > [ERROR] [Management$] Unable to connect to all storage backends > successfully. > The following shows the error message from the storage backend. > > Failed after attempts=1, exceptions: > Wed Aug 02 11:45:04 UTC 2017, org.apache.hadoop.hbase.client > .RpcRetryingCaller@43045f9f, java.net.SocketTimeoutException: Call to > localhost/127.0.0.1:39562 failed because java.net.SocketTimeoutException: > 60000 millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/127.0.0.1:51462 remote= > localhost/127.0.0.1:39562] > (org.apache.hadoop.hbase.client.RetriesExhaustedException) > > Dumping configuration of initialized storage backend sources. > Please make sure they are correct. > > Source Name: ELASTICSEARCH; Type: elasticsearch; Configuration: HOSTS -> > 127.0.0.1, TYPE -> elasticsearch, CLUSTERNAME -> elasticsearch > Source Name: HBASE; Type: hbase; Configuration: TYPE -> hbase > Source Name: HDFS; Type: hdfs; Configuration: TYPE -> hdfs, PATH -> /models > > Do you know what is the problem? How can I restart the services once the > system fails? > > Thanks. > > Carlos Vidal. > > >