BTW a single machine installation is not likely to be good for production because of the possible resource contention issue. So you should think of it as a way to experiment and get an idea of how things work, what the input and tuning looks like, etc. Then move to a multi-machine cluster for production, if only because it limits resource contention. The cluster will use smaller machines than a single machine with all-in-one.
If you want actual results with enough data to make good recommendations, the quickest way may be to get a bigger instance (vertical scaling) but consider splitting this apart for production. On Aug 3, 2017, at 8:32 AM, Pat Ferrel <[email protected]> wrote: It should be easy to try a smaller batch of data first since we are just guessing On Aug 2, 2017, at 11:22 PM, Carlos Vidal <[email protected] <mailto:[email protected]>> wrote: Hello Mahesh, Pat Thanks for your answers. I will try with a bigger EC2 instance. Carlos. 2017-08-02 18:42 GMT+02:00 Pat Ferrel <[email protected] <mailto:[email protected]>>: Actually memory may be your problem. Mahesh Hegde may be right about trying smaller sets. Since it sounds like you have all services running on one machine, they may be in contention for resources. On Aug 2, 2017, at 9:35 AM, Pat Ferrel <[email protected] <mailto:[email protected]>> wrote: Something is not configured correctly `pio import` should work with any size of file but this may be an undersized instance for that much data. Spark needs memory, it keeps all data that it needs for a particular calculation spread across all cluster machines in memory. That includes derived data so a total of 32g may not be enough. But that is not your current problem. I would start by verifying that all components are working properly, starting with HDFS, then HBase, then Spark, then Elasticsearch. I see several storage backend errors below. On Aug 2, 2017, at 4:52 AM, Carlos Vidal <[email protected] <mailto:[email protected]>> wrote: Hello, I have installed the pio + ur AMI in AWS, in an m4.2xlarge instance with 32GB of RAM and 8 VCPU. When I try to import a 20GB events file por my application, the system crashes. The command I have used is: pio import --appid 4 --input my_events.json this command launch an spark job that needs to perform 800 task. When the process reaches the task 211 it crashes. This is what I can see in my pio.log file: 2017-08-02 11:16:17,101 WARN org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation [htable-pool230-t1] - Encountered problems when prefetch hbase:meta table: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=35, exceptions: Wed Aug 02 11:07:06 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed servers list: localhost/127.0.0.1:44866 <http://127.0.0.1:44866/> Wed Aug 02 11:07:07 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed servers list: localhost/127.0.0.1:44866 <http://127.0.0.1:44866/> Wed Aug 02 11:07:07 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed servers list: localhost/127.0.0.1:44866 <http://127.0.0.1:44866/> Wed Aug 02 11:07:08 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in the failed servers list: localhost/127.0.0.1:44866 <http://127.0.0.1:44866/> Wed Aug 02 11:07:10 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:07:14 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:07:24 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:07:34 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:07:44 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:07:54 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:08:15 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:08:35 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:08:55 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:09:15 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:09:35 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:09:55 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:10:15 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:10:35 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:10:55 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:11:15 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:11:35 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:11:55 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:12:15 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:12:35 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:12:56 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:13:16 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:13:36 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:13:56 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:14:16 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:14:36 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:14:56 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:15:16 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:15:36 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:15:56 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused Wed Aug 02 11:16:17 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, java.net.ConnectException: Connection refused at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:129) at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:714) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1153) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1217) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1105) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1062) at org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.java:365) at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:507) at org.apache.hadoop.hbase.client.AsyncProcess.logAndResubmit(AsyncProcess.java:717) at org.apache.hadoop.hbase.client.AsyncProcess.receiveGlobalFailure(AsyncProcess.java:664) at org.apache.hadoop.hbase.client.AsyncProcess.access$100(AsyncProcess.java:93) at org.apache.hadoop.hbase.client.AsyncProcess$1.run(AsyncProcess.java:547) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net <http://org.apache.hadoop.net/>.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net <http://org.apache.hadoop.net/>.NetUtils.connect(NetUtils.java:531) at org.apache.hadoop.net <http://org.apache.hadoop.net/>.NetUtils.connect(NetUtils.java:495) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupConnection(RpcClient.java:578) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:868) at org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1543) at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1442) at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1661) at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.get(ClientProtos.java:29966) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1508) at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:710) at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:708) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:114) ... 17 more 2017-08-02 11:21:04,430 ERROR org.apache.spark.scheduler.LiveListenerBus [Thread-3] - SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@66c4a5d2) 2017-08-02 11:21:04,431 ERROR org.apache.spark.scheduler.LiveListenerBus [Thread-3] - SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(0,1501672864431,JobFailed(org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down)) 2017-08-02 11:28:47,129 INFO org.apache.predictionio.tools.commands.Management$ [main] - Inspecting PredictionIO... 2017-08-02 11:28:47,132 INFO org.apache.predictionio.tools.commands.Management$ [main] - PredictionIO 0.11.0-incubating is installed at /opt/data/PredictionIO-0.11.0-incubating 2017-08-02 11:28:47,132 INFO org.apache.predictionio.tools.commands.Management$ [main] - Inspecting Apache Spark... 2017-08-02 11:28:47,142 INFO org.apache.predictionio.tools.commands.Management$ [main] - Apache Spark is installed at /usr/local/spark 2017-08-02 11:28:47,175 INFO org.apache.predictionio.tools.commands.Management$ [main] - Apache Spark 1.6.3 detected (meets minimum requirement of 1.3.0) 2017-08-02 11:28:47,175 INFO org.apache.predictionio.tools.commands.Management$ [main] - Inspecting storage backend connections... 2017-08-02 11:28:47,195 INFO org.apache.predictionio.data.storage.Storage$ [main] - Verifying Meta Data Backend (Source: ELASTICSEARCH)... 2017-08-02 11:28:48,225 INFO org.apache.predictionio.data.storage.Storage$ [main] - Verifying Model Data Backend (Source: HDFS)... 2017-08-02 11:28:48,447 INFO org.apache.predictionio.data.storage.Storage$ [main] - Verifying Event Data Backend (Source: HBASE)... 2017-08-02 11:28:48,979 INFO org.apache.predictionio.data.storage.Storage$ [main] - Test writing to Event Store (App Id 0)... 2017-08-02 11:29:49,026 ERROR org.apache.predictionio.tools.commands.Management$ [main] - Unable to connect to all storage backends successfully. On the other hand, once this happens, if I run pio status this is what I obtain: aml@ip-10-41-11-227:~$ pio status SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/data/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/data/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings <http://www.slf4j.org/codes.html#multiple_bindings> for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] [INFO] [Management$] Inspecting PredictionIO... [INFO] [Management$] PredictionIO 0.11.0-incubating is installed at /opt/data/PredictionIO-0.11.0-incubating [INFO] [Management$] Inspecting Apache Spark... [INFO] [Management$] Apache Spark is installed at /usr/local/spark [INFO] [Management$] Apache Spark 1.6.3 detected (meets minimum requirement of 1.3.0) [INFO] [Management$] Inspecting storage backend connections... [INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)... [INFO] [Storage$] Verifying Model Data Backend (Source: HDFS)... [INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)... [INFO] [Storage$] Test writing to Event Store (App Id 0)... [ERROR] [Management$] Unable to connect to all storage backends successfully. The following shows the error message from the storage backend. Failed after attempts=1, exceptions: Wed Aug 02 11:45:04 UTC 2017, org.apache.hadoop.hbase.client.RpcRetryingCaller@43045f9f, java.net <http://java.net/>.SocketTimeoutException: Call to localhost/127.0.0.1:39562 <http://127.0.0.1:39562/> failed because java.net <http://java.net/>.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:51462 <http://127.0.0.1:51462/> remote=localhost/127.0.0.1:39562 <http://127.0.0.1:39562/>] (org.apache.hadoop.hbase.client.RetriesExhaustedException) Dumping configuration of initialized storage backend sources. Please make sure they are correct. Source Name: ELASTICSEARCH; Type: elasticsearch; Configuration: HOSTS -> 127.0.0.1, TYPE -> elasticsearch, CLUSTERNAME -> elasticsearch Source Name: HBASE; Type: hbase; Configuration: TYPE -> hbase Source Name: HDFS; Type: hdfs; Configuration: TYPE -> hdfs, PATH -> /models Do you know what is the problem? How can I restart the services once the system fails? Thanks. Carlos Vidal.
