BTW a single machine installation is not likely to be good for production 
because of the possible resource contention issue. So you should think of it as 
a way to experiment and get an idea of how things work, what the input and 
tuning looks like, etc. Then move to a multi-machine cluster for production, if 
only because it limits resource contention. The cluster will use smaller 
machines than a single machine with all-in-one.

If you want actual results with enough data to make good recommendations, the 
quickest way may be to get a bigger instance (vertical scaling) but consider 
splitting this apart for production.


On Aug 3, 2017, at 8:32 AM, Pat Ferrel <[email protected]> wrote:

It should be easy to try a smaller batch of data first since we are just 
guessing


On Aug 2, 2017, at 11:22 PM, Carlos Vidal <[email protected] 
<mailto:[email protected]>> wrote:

Hello Mahesh, Pat

Thanks for your answers. I will try with a bigger EC2 instance.

Carlos.

2017-08-02 18:42 GMT+02:00 Pat Ferrel <[email protected] 
<mailto:[email protected]>>:
Actually memory may be your problem. Mahesh Hegde may be right about trying 
smaller sets. Since it sounds like you have all services running on one 
machine, they may be in contention for resources.


On Aug 2, 2017, at 9:35 AM, Pat Ferrel <[email protected] 
<mailto:[email protected]>> wrote:

Something is not configured correctly `pio import` should work with any size of 
file but this may be an undersized instance for that much data.

Spark needs memory, it keeps all data that it needs for a particular 
calculation spread across all cluster machines in memory. That includes derived 
data so a total of 32g may not be enough. But that is not your current problem.

I would start by verifying that all components are working properly, starting 
with HDFS, then HBase, then Spark, then Elasticsearch. I see several storage 
backend errors below.



On Aug 2, 2017, at 4:52 AM, Carlos Vidal <[email protected] 
<mailto:[email protected]>> wrote:

Hello,

I have installed the pio + ur AMI in AWS, in an m4.2xlarge instance with 32GB 
of RAM and 8 VCPU. 

When I try to import a 20GB events file por my application, the system crashes. 
The command I have used is:


pio import --appid 4 --input my_events.json

this command launch an spark job that needs to perform 800 task. When the 
process reaches the task 211 it crashes. This is what I can see in my pio.log 
file:

2017-08-02 11:16:17,101 WARN  
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation 
[htable-pool230-t1] - Encountered problems when prefetch hbase:meta table: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=35, exceptions:
Wed Aug 02 11:07:06 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in 
the failed servers list: localhost/127.0.0.1:44866 <http://127.0.0.1:44866/>
Wed Aug 02 11:07:07 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in 
the failed servers list: localhost/127.0.0.1:44866 <http://127.0.0.1:44866/>
Wed Aug 02 11:07:07 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in 
the failed servers list: localhost/127.0.0.1:44866 <http://127.0.0.1:44866/>
Wed Aug 02 11:07:08 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in 
the failed servers list: localhost/127.0.0.1:44866 <http://127.0.0.1:44866/>
Wed Aug 02 11:07:10 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:07:14 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:07:24 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:07:34 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:07:44 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:07:54 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:08:15 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:08:35 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:08:55 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:09:15 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:09:35 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:09:55 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:10:15 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:10:35 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:10:55 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:11:15 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:11:35 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:11:55 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:12:15 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:12:35 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:12:56 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:13:16 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:13:36 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:13:56 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:14:16 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:14:36 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:14:56 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:15:16 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:15:36 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:15:56 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused
Wed Aug 02 11:16:17 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@475db952, 
java.net.ConnectException: Connection refused

        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:129)
        at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:714)
        at 
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1153)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1217)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1105)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1062)
        at 
org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.java:365)
        at 
org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:507)
        at 
org.apache.hadoop.hbase.client.AsyncProcess.logAndResubmit(AsyncProcess.java:717)
        at 
org.apache.hadoop.hbase.client.AsyncProcess.receiveGlobalFailure(AsyncProcess.java:664)
        at 
org.apache.hadoop.hbase.client.AsyncProcess.access$100(AsyncProcess.java:93)
        at 
org.apache.hadoop.hbase.client.AsyncProcess$1.run(AsyncProcess.java:547)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.hadoop.net 
<http://org.apache.hadoop.net/>.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net 
<http://org.apache.hadoop.net/>.NetUtils.connect(NetUtils.java:531)
        at org.apache.hadoop.net 
<http://org.apache.hadoop.net/>.NetUtils.connect(NetUtils.java:495)
        at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupConnection(RpcClient.java:578)
        at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:868)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1543)
        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1442)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1661)
        at 
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719)
        at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.get(ClientProtos.java:29966)
        at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1508)
        at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:710)
        at org.apache.hadoop.hbase.client.HTable$2.call(HTable.java:708)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:114)
        ... 17 more
2017-08-02 11:21:04,430 ERROR org.apache.spark.scheduler.LiveListenerBus 
[Thread-3] - SparkListenerBus has already stopped! Dropping event 
SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@66c4a5d2)
2017-08-02 11:21:04,431 ERROR org.apache.spark.scheduler.LiveListenerBus 
[Thread-3] - SparkListenerBus has already stopped! Dropping event 
SparkListenerJobEnd(0,1501672864431,JobFailed(org.apache.spark.SparkException: 
Job 0 cancelled because SparkContext was shut down))
2017-08-02 11:28:47,129 INFO  
org.apache.predictionio.tools.commands.Management$ [main] - Inspecting 
PredictionIO...
2017-08-02 11:28:47,132 INFO  
org.apache.predictionio.tools.commands.Management$ [main] - PredictionIO 
0.11.0-incubating is installed at /opt/data/PredictionIO-0.11.0-incubating
2017-08-02 11:28:47,132 INFO  
org.apache.predictionio.tools.commands.Management$ [main] - Inspecting Apache 
Spark...
2017-08-02 11:28:47,142 INFO  
org.apache.predictionio.tools.commands.Management$ [main] - Apache Spark is 
installed at /usr/local/spark
2017-08-02 11:28:47,175 INFO  
org.apache.predictionio.tools.commands.Management$ [main] - Apache Spark 1.6.3 
detected (meets minimum requirement of 1.3.0)
2017-08-02 11:28:47,175 INFO  
org.apache.predictionio.tools.commands.Management$ [main] - Inspecting storage 
backend connections...
2017-08-02 11:28:47,195 INFO  org.apache.predictionio.data.storage.Storage$ 
[main] - Verifying Meta Data Backend (Source: ELASTICSEARCH)...
2017-08-02 11:28:48,225 INFO  org.apache.predictionio.data.storage.Storage$ 
[main] - Verifying Model Data Backend (Source: HDFS)...
2017-08-02 11:28:48,447 INFO  org.apache.predictionio.data.storage.Storage$ 
[main] - Verifying Event Data Backend (Source: HBASE)...
2017-08-02 11:28:48,979 INFO  org.apache.predictionio.data.storage.Storage$ 
[main] - Test writing to Event Store (App Id 0)...
2017-08-02 11:29:49,026 ERROR 
org.apache.predictionio.tools.commands.Management$ [main] - Unable to connect 
to all storage backends successfully.






On the other hand, once this happens, if I run pio status this is what I obtain:

aml@ip-10-41-11-227:~$ pio status
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/opt/data/PredictionIO-0.11.0-incubating/lib/spark/pio-data-hdfs-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/data/PredictionIO-0.11.0-incubating/lib/pio-assembly-0.11.0-incubating.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings 
<http://www.slf4j.org/codes.html#multiple_bindings> for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
[INFO] [Management$] Inspecting PredictionIO...
[INFO] [Management$] PredictionIO 0.11.0-incubating is installed at 
/opt/data/PredictionIO-0.11.0-incubating
[INFO] [Management$] Inspecting Apache Spark...
[INFO] [Management$] Apache Spark is installed at /usr/local/spark
[INFO] [Management$] Apache Spark 1.6.3 detected (meets minimum requirement of 
1.3.0)
[INFO] [Management$] Inspecting storage backend connections...
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
[INFO] [Storage$] Verifying Model Data Backend (Source: HDFS)...
[INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)...
[INFO] [Storage$] Test writing to Event Store (App Id 0)...
[ERROR] [Management$] Unable to connect to all storage backends successfully.
The following shows the error message from the storage backend.

Failed after attempts=1, exceptions:
Wed Aug 02 11:45:04 UTC 2017, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@43045f9f, java.net 
<http://java.net/>.SocketTimeoutException: Call to localhost/127.0.0.1:39562 
<http://127.0.0.1:39562/> failed because java.net 
<http://java.net/>.SocketTimeoutException: 60000 millis timeout while waiting 
for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/127.0.0.1:51462 
<http://127.0.0.1:51462/> remote=localhost/127.0.0.1:39562 
<http://127.0.0.1:39562/>]
 (org.apache.hadoop.hbase.client.RetriesExhaustedException)

Dumping configuration of initialized storage backend sources.
Please make sure they are correct.

Source Name: ELASTICSEARCH; Type: elasticsearch; Configuration: HOSTS -> 
127.0.0.1, TYPE -> elasticsearch, CLUSTERNAME -> elasticsearch
Source Name: HBASE; Type: hbase; Configuration: TYPE -> hbase
Source Name: HDFS; Type: hdfs; Configuration: TYPE -> hdfs, PATH -> /models

Do you know what is the problem? How can I restart the services once the system 
fails? 

Thanks.

Carlos Vidal.





Reply via email to