Pat, this worked!!!!! Thank you very much!!!! The only odd thing now is that all the results I get now are 0s. For example:
Using the dataset: "u1","i1" "u2","i1" "u2","i2" "u3","i2" "u3","i3" "u4","i4" echo "Recommendations for user: u1" echo "" curl -H "Content-Type: application/json" -d ' { "user": "u1" }' http://localhost:8000/queries.json echo "" What I get is: {"itemScores":[{"item":"\"i2\"","score":0.0},{"item":"\"i1\"","score":0.0},{"item":"\"i3\"","score":0.0},{"item":"\"i4\"","score":0.0}]} If user u1 has viewed i1 and user u2 has viewed i1 and i2 then I think the algorithm should return a non-zore score for i2 (and possible i1, too). Even using the bigger dataset with 100 items I still get all scores 0s. So now I'm going to spend some time reading the following documentation, unless there is some other documentation you recommend I read first! - [The Universal Recommender](http://actionml.com/docs/ur) - [The Correlated Cross-Occurrence Algorithm]( http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html) - [The Universal Recommender Slide Deck]( http://www.slideshare.net/pferrel/unified-recommender-39986309) - [Multi-domain predictive AI or how to make one thing predict another]( https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ ) Thank you very much for all your patience and help getting me to this point!!! Best regards, Noelia On 18 October 2017 at 18:33, Pat Ferrel <p...@occamsmachete.com> wrote: > It is the UR so Events are taken from the EventStore and converted into a > Mahout DistributedRowMatrix of RandomAccessSparseVectors, which are both > serializable. This path works fine and has for several years. > > This must be a config problem, like not using the MahoutKryoRegistrator, > which registers the serializers for these. > > @Noelia, you have left out the sparkConf section of the engine.json. The > one used in the integration test should work: > > { > "comment":" This config file uses default settings for all but the > required values see README.md for docs", > "id": "default", > "description": "Default settings", > "engineFactory": "com.actionml.RecommendationEngine", > "datasource": { > "params" : { > "name": "tiny_app_data.csv", > "appName": "TinyApp", > "eventNames": ["view"] > } > }, > "sparkConf": { <================= THIS WAS LEFT OUT IN YOUR ENGINE.JSON > BELOW IN THIS THREAD > "spark.serializer": "org.apache.spark.serializer.KryoSerializer", > "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io. > MahoutKryoRegistrator", > "spark.kryo.referenceTracking": "false", > "spark.kryoserializer.buffer": "300m", > "es.index.auto.create": "true" > }, > "algorithms": [ > { > "comment": "simplest setup where all values are default, popularity > based backfill, must add eventsNames", > "name": "ur", > "params": { > "appName": "TinyApp", > "indexName": "urindex", > "typeName": "items", > "comment": "must have data for the first event or the model will > not build, other events are optional", > "eventNames": ["view"] > } > } > ] > } > > > On Oct 18, 2017, at 8:49 AM, Donald Szeto <don...@apache.org> wrote: > > Chiming in a bit. Looking at the serialization error, it looks like we are > just one little step away from getting this to work. > > Noelia, what does your synthesized data look like? All data that is > processed by Spark needs to be serializable. At some point, a > non-serializable vector object showing in the stack is created out of your > synthesized data. It would be great to know what your input event looks > like and see where in the code path has caused this. > > Regards, > Donald > > On Tue, Oct 17, 2017 at 12:14 AM Noelia Osés Fernández < > no...@vicomtech.org> wrote: > >> Pat, you mentioned the problem could be that the data I was using was too >> small. So now I'm using the attached data file as the data (4 users and 100 >> items). But I'm still getting the same error. I'm sorry I forgot to mention >> I had increased the dataset. >> >> The reason why I want to make it work with a very small dataset is >> because I want to be able to follow the calculations. I want to understand >> what the UR is doing and understand the impact of changing this or that, >> here or there... I find that easier to achieve with a small example in >> which I know exactly what's happening. I want to build my trust on my >> understanding of the UR before I move on to applying it to a real problem. >> If I'm not confident that I know how to use it, how can I tell my client >> that the results I'm getting are good with any degree of confidence? >> >> >> >> >> >> On 16 October 2017 at 20:44, Pat Ferrel <p...@occamsmachete.com> wrote: >> >>> So all setup is the same for the integration-test and your modified test >>> *except the data*? >>> >>> The error looks like a setup problem because the serialization should >>> happen with either test. But if the only difference really is the data, >>> then toss it and use either real data or the integration test data, why are >>> you trying to synthesize fake data if it causes the error? >>> >>> BTW the data you include below in this thread would never create >>> internal IDs as high as 94 in the vector. You must have switched to a new >>> dataset??? >>> >>> I would get a dump of your data using `pio export` and make sure it’s >>> what you thought it was. You claim to have only 4 user ids and 4 item ids >>> but the serialized vector thinks you have at least 94 of user or item ids. >>> Something doesn’t add up. >>> >>> >>> On Oct 16, 2017, at 4:43 AM, Noelia Osés Fernández <no...@vicomtech.org> >>> wrote: >>> >>> Pat, you are absolutely right! I increased the sleep time and now the >>> integration test for handmade works perfectly. >>> >>> However, the integration test adapted to run with my tiny app runs into >>> the same problem I've been having with this app: >>> >>> [ERROR] [TaskSetManager] Task 1.0 in stage 10.0 (TID 23) had a not >>> serializable result: org.apache.mahout.math.RandomAccessSparseVector >>> Serialization stack: >>> - object not serializable (class: >>> org.apache.mahout.math.RandomAccessSparseVector, >>> value: {66:1.0,29:1.0,70:1.0,91:1.0,58:1.0,37:1.0,13:1.0,8:1.0,94: >>> 1.0,30:1.0,57:1.0,22:1.0,20:1.0,35:1.0,97:1.0,60:1.0,27:1.0, >>> 72:1.0,3:1.0,34:1.0,77:1.0,46:1.0,81:1.0,86:1.0,43:1.0}) >>> - field (class: scala.Tuple2, name: _2, type: class >>> java.lang.Object) >>> - object (class scala.Tuple2, (1,{66:1.0,29:1.0,70:1.0,91:1. >>> 0,58:1.0,37:1.0,13:1.0,8:1.0,94:1.0,30:1.0,57:1.0,22:1.0, >>> 20:1.0,35:1.0,97:1.0,60:1.0,27:1.0,72:1.0,3:1.0,34:1.0,77: >>> 1.0,46:1.0,81:1.0,86:1.0,43:1.0})); not retrying >>> [ERROR] [TaskSetManager] Task 2.0 in stage 10.0 (TID 24) had a not >>> serializable result: org.apache.mahout.math.RandomAccessSparseVector >>> Serialization stack: >>> >>> ... >>> >>> Any ideas? >>> >>> On 15 October 2017 at 19:09, Pat Ferrel <p...@occamsmachete.com> wrote: >>> >>>> This is probably a timing issue in the integration test, which has to >>>> wait for `pio deploy` to finish before the queries can be made. If it >>>> doesn’t finish the queries will fail. By the time the rest of the test >>>> quits the model has been deployed so you can run queries. In the >>>> integration-test script increase the delay after `pio deploy…` and see if >>>> it passes then. >>>> >>>> This is probably an integrtion-test script problem not a problem in the >>>> system >>>> >>>> >>>> >>>> On Oct 6, 2017, at 4:21 AM, Noelia Osés Fernández <no...@vicomtech.org> >>>> wrote: >>>> >>>> Pat, >>>> >>>> I have run the integration test for the handmade example out of >>>> curiosity. Strangely enough things go more or less as expected apart from >>>> the fact that I get a message saying: >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> *...[INFO] [CoreWorkflow$] Updating engine instance[INFO] >>>> [CoreWorkflow$] Training completed successfully.Model will remain deployed >>>> after this testWaiting 30 seconds for the server to startnohup: redirecting >>>> stderr to stdout % Total % Received % Xferd Average Speed Time >>>> Time Time Current Dload Upload >>>> Total Spent Left Speed 0 0 0 0 0 0 0 0 >>>> --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost >>>> port 8000: Connection refused* >>>> So the integration test does not manage to get the recommendations even >>>> though the model trained and deployed successfully. However, as soon as the >>>> integration test finishes, on the same terminal, I can get the >>>> recommendations by doing the following: >>>> >>>> $ curl -H "Content-Type: application/json" -d ' >>>> > { >>>> > "user": "u1" >>>> > }' http://localhost:8000/queries.json >>>> {"itemScores":[{"item":"Nexus","score":0.057719700038433075} >>>> ,{"item":"Surface","score":0.0}]} >>>> >>>> Isn't this odd? Can you guess what's going on? >>>> >>>> Thank you very much for all your support! >>>> noelia >>>> >>>> >>>> >>>> On 5 October 2017 at 19:22, Pat Ferrel <p...@occamsmachete.com> wrote: >>>> >>>>> Ok, that config should work. Does the integration test pass? >>>>> >>>>> The data you are using is extremely small and though it does look like >>>>> it has cooccurrences, they may not meet minimum “big-data” thresholds used >>>>> by default. Try adding more data or use the handmade example data, rename >>>>> purchase to view and discard the existing view data if you wish. >>>>> >>>>> The error is very odd and I’ve never seen it. If the integration test >>>>> works I can only surmise it's your data. >>>>> >>>>> >>>>> On Oct 5, 2017, at 12:02 AM, Noelia Osés Fernández < >>>>> no...@vicomtech.org> wrote: >>>>> >>>>> SPARK: spark-1.6.3-bin-hadoop2.6 >>>>> >>>>> PIO: 0.11.0-incubating >>>>> >>>>> Scala: whatever gets installed when installing PIO 0.11.0-incubating, >>>>> I haven't installed Scala separately >>>>> >>>>> UR: ActionML's UR v0.6.0 I suppose as that's the last version >>>>> mentioned in the readme file. I have attached the UR zip file I downloaded >>>>> from the actionml github account. >>>>> >>>>> Thank you for your help!! >>>>> >>>>> On 4 October 2017 at 17:20, Pat Ferrel <p...@occamsmachete.com> wrote: >>>>> >>>>>> What version of Scala. Spark, PIO, and UR are you using? >>>>>> >>>>>> >>>>>> On Oct 4, 2017, at 6:10 AM, Noelia Osés Fernández < >>>>>> no...@vicomtech.org> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> I'm still trying to create a very simple app to learn to use >>>>>> PredictionIO and still having trouble. I have done pio build no problem. >>>>>> But when I do pio train I get a very long error message related to >>>>>> serialisation (error message copied below). >>>>>> >>>>>> pio status reports system is all ready to go. >>>>>> >>>>>> The app I'm trying to build is very simple, it only has 'view' >>>>>> events. Here's the engine.json: >>>>>> >>>>>> *===========================================================* >>>>>> { >>>>>> "comment":" This config file uses default settings for all but the >>>>>> required values see README.md for docs", >>>>>> "id": "default", >>>>>> "description": "Default settings", >>>>>> "engineFactory": "com.actionml.RecommendationEngine", >>>>>> "datasource": { >>>>>> "params" : { >>>>>> "name": "tiny_app_data.csv", >>>>>> "appName": "TinyApp", >>>>>> "eventNames": ["view"] >>>>>> } >>>>>> }, >>>>>> "algorithms": [ >>>>>> { >>>>>> "comment": "simplest setup where all values are default, >>>>>> popularity based backfill, must add eventsNames", >>>>>> "name": "ur", >>>>>> "params": { >>>>>> "appName": "TinyApp", >>>>>> "indexName": "urindex", >>>>>> "typeName": "items", >>>>>> "comment": "must have data for the first event or the model >>>>>> will not build, other events are optional", >>>>>> "eventNames": ["view"] >>>>>> } >>>>>> } >>>>>> ] >>>>>> } >>>>>> *===========================================================* >>>>>> >>>>>> The data I'm using is: >>>>>> >>>>>> "u1","i1" >>>>>> "u2","i1" >>>>>> "u2","i2" >>>>>> "u3","i2" >>>>>> "u3","i3" >>>>>> "u4","i4" >>>>>> >>>>>> meaning user u viewed item i. >>>>>> >>>>>> The data has been added to the database with the following python >>>>>> code: >>>>>> >>>>>> *===========================================================* >>>>>> """ >>>>>> Import sample data for recommendation engine >>>>>> """ >>>>>> >>>>>> import predictionio >>>>>> import argparse >>>>>> import random >>>>>> >>>>>> RATE_ACTIONS_DELIMITER = "," >>>>>> SEED = 1 >>>>>> >>>>>> >>>>>> def import_events(client, file): >>>>>> f = open(file, 'r') >>>>>> random.seed(SEED) >>>>>> count = 0 >>>>>> print "Importing data..." >>>>>> >>>>>> items = [] >>>>>> users = [] >>>>>> f = open(file, 'r') >>>>>> for line in f: >>>>>> data = line.rstrip('\r\n').split(RATE_ACTIONS_DELIMITER) >>>>>> users.append(data[0]) >>>>>> items.append(data[1]) >>>>>> client.create_event( >>>>>> event="view", >>>>>> entity_type="user", >>>>>> entity_id=data[0], >>>>>> target_entity_type="item", >>>>>> target_entity_id=data[1] >>>>>> ) >>>>>> print "Event: " + "view" + " entity_id: " + data[0] + " >>>>>> target_entity_id: " + data[1] >>>>>> count += 1 >>>>>> f.close() >>>>>> >>>>>> users = set(users) >>>>>> items = set(items) >>>>>> print "All users: " + str(users) >>>>>> print "All items: " + str(items) >>>>>> for item in items: >>>>>> client.create_event( >>>>>> event="$set", >>>>>> entity_type="item", >>>>>> entity_id=item >>>>>> ) >>>>>> count += 1 >>>>>> >>>>>> >>>>>> print "%s events are imported." % count >>>>>> >>>>>> >>>>>> if __name__ == '__main__': >>>>>> parser = argparse.ArgumentParser( >>>>>> description="Import sample data for recommendation engine") >>>>>> parser.add_argument('--access_key', default='invald_access_key') >>>>>> parser.add_argument('--url', default="http://localhost:7070") >>>>>> parser.add_argument('--file', default="./data/tiny_app_data.csv") >>>>>> >>>>>> args = parser.parse_args() >>>>>> print args >>>>>> >>>>>> client = predictionio.EventClient( >>>>>> access_key=args.access_key, >>>>>> url=args.url, >>>>>> threads=5, >>>>>> qsize=500) >>>>>> import_events(client, args.file) >>>>>> *===========================================================* >>>>>> >>>>>> My pio_env.sh is the following: >>>>>> >>>>>> *===========================================================* >>>>>> #!/usr/bin/env bash >>>>>> # >>>>>> # Copy this file as pio-env.sh and edit it for your site's >>>>>> configuration. >>>>>> # >>>>>> # Licensed to the Apache Software Foundation (ASF) under one or more >>>>>> # contributor license agreements. See the NOTICE file distributed >>>>>> with >>>>>> # this work for additional information regarding copyright ownership. >>>>>> # The ASF licenses this file to You under the Apache License, Version >>>>>> 2.0 >>>>>> # (the "License"); you may not use this file except in compliance with >>>>>> # the License. You may obtain a copy of the License at >>>>>> # >>>>>> # http://www.apache.org/licenses/LICENSE-2.0 >>>>>> # >>>>>> # Unless required by applicable law or agreed to in writing, software >>>>>> # distributed under the License is distributed on an "AS IS" BASIS, >>>>>> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >>>>>> implied. >>>>>> # See the License for the specific language governing permissions and >>>>>> # limitations under the License. >>>>>> # >>>>>> >>>>>> # PredictionIO Main Configuration >>>>>> # >>>>>> # This section controls core behavior of PredictionIO. It is very >>>>>> likely that >>>>>> # you need to change these to fit your site. >>>>>> >>>>>> # SPARK_HOME: Apache Spark is a hard dependency and must be >>>>>> configured. >>>>>> # SPARK_HOME=$PIO_HOME/vendors/spark-2.0.2-bin-hadoop2.7 >>>>>> SPARK_HOME=$PIO_HOME/vendors/spark-1.6.3-bin-hadoop2.6 >>>>>> >>>>>> POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.1.4.jar >>>>>> MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar >>>>>> >>>>>> # ES_CONF_DIR: You must configure this if you have advanced >>>>>> configuration for >>>>>> # your Elasticsearch setup. >>>>>> # ES_CONF_DIR=/opt/elasticsearch >>>>>> #ES_CONF_DIR=$PIO_HOME/vendors/elasticsearch-1.7.6 >>>>>> >>>>>> # HADOOP_CONF_DIR: You must configure this if you intend to run >>>>>> PredictionIO >>>>>> # with Hadoop 2. >>>>>> # HADOOP_CONF_DIR=/opt/hadoop >>>>>> >>>>>> # HBASE_CONF_DIR: You must configure this if you intend to run >>>>>> PredictionIO >>>>>> # with HBase on a remote cluster. >>>>>> # HBASE_CONF_DIR=$PIO_HOME/vendors/hbase-1.0.0/conf >>>>>> >>>>>> # Filesystem paths where PredictionIO uses as block storage. >>>>>> PIO_FS_BASEDIR=$HOME/.pio_store >>>>>> PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines >>>>>> PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp >>>>>> >>>>>> # PredictionIO Storage Configuration >>>>>> # >>>>>> # This section controls programs that make use of PredictionIO's >>>>>> built-in >>>>>> # storage facilities. Default values are shown below. >>>>>> # >>>>>> # For more information on storage configuration please refer to >>>>>> # http://predictionio.incubator.apache.org/system/anotherdatastore/ >>>>>> >>>>>> # Storage Repositories >>>>>> >>>>>> # Default is to use PostgreSQL >>>>>> PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta >>>>>> PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH >>>>>> >>>>>> PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event >>>>>> PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE >>>>>> >>>>>> PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model >>>>>> PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS >>>>>> >>>>>> # Storage Data Sources >>>>>> >>>>>> # PostgreSQL Default Settings >>>>>> # Please change "pio" to your database name in >>>>>> PIO_STORAGE_SOURCES_PGSQL_URL >>>>>> # Please change PIO_STORAGE_SOURCES_PGSQL_USERNAME and >>>>>> # PIO_STORAGE_SOURCES_PGSQL_PASSWORD accordingly >>>>>> PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc >>>>>> PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio >>>>>> PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio >>>>>> PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio >>>>>> >>>>>> # MySQL Example >>>>>> # PIO_STORAGE_SOURCES_MYSQL_TYPE=jdbc >>>>>> # PIO_STORAGE_SOURCES_MYSQL_URL=jdbc:mysql://localhost/pio >>>>>> # PIO_STORAGE_SOURCES_MYSQL_USERNAME=pio >>>>>> # PIO_STORAGE_SOURCES_MYSQL_PASSWORD=pio >>>>>> >>>>>> # Elasticsearch Example >>>>>> # PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch >>>>>> # PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost >>>>>> # PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200 >>>>>> # PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http >>>>>> # PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/ >>>>>> vendors/elasticsearch-5.2.1 >>>>>> # Elasticsearch 1.x Example >>>>>> PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch >>>>>> PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=myprojectES >>>>>> PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost >>>>>> PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300 >>>>>> PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/ >>>>>> vendors/elasticsearch-1.7.6 >>>>>> >>>>>> # Local File System Example >>>>>> PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs >>>>>> PIO_STORAGE_SOURCES_LOCALFS_PATH=$PIO_FS_BASEDIR/models >>>>>> >>>>>> # HBase Example >>>>>> PIO_STORAGE_SOURCES_HBASE_TYPE=hbase >>>>>> PIO_STORAGE_SOURCES_HBASE_HOME=$PIO_HOME/vendors/hbase-1.2.6 >>>>>> >>>>>> >>>>>> *===========================================================Error >>>>>> message:* >>>>>> >>>>>> *===========================================================* >>>>>> [ERROR] [TaskSetManager] Task 2.0 in stage 10.0 (TID 24) had a not >>>>>> serializable result: org.apache.mahout.math.RandomAccessSparseVector >>>>>> Serialization stack: >>>>>> - object not serializable (class: >>>>>> org.apache.mahout.math.RandomAccessSparseVector, >>>>>> value: {3:1.0,2:1.0}) >>>>>> - field (class: scala.Tuple2, name: _2, type: class >>>>>> java.lang.Object) >>>>>> - object (class scala.Tuple2, (2,{3:1.0,2:1.0})); not retrying >>>>>> [ERROR] [TaskSetManager] Task 3.0 in stage 10.0 (TID 25) had a not >>>>>> serializable result: org.apache.mahout.math.RandomAccessSparseVector >>>>>> Serialization stack: >>>>>> - object not serializable (class: >>>>>> org.apache.mahout.math.RandomAccessSparseVector, >>>>>> value: {0:1.0,3:1.0}) >>>>>> - field (class: scala.Tuple2, name: _2, type: class >>>>>> java.lang.Object) >>>>>> - object (class scala.Tuple2, (3,{0:1.0,3:1.0})); not retrying >>>>>> [ERROR] [TaskSetManager] Task 1.0 in stage 10.0 (TID 23) had a not >>>>>> serializable result: org.apache.mahout.math.RandomAccessSparseVector >>>>>> Serialization stack: >>>>>> - object not serializable (class: >>>>>> org.apache.mahout.math.RandomAccessSparseVector, >>>>>> value: {1:1.0}) >>>>>> - field (class: scala.Tuple2, name: _2, type: class >>>>>> java.lang.Object) >>>>>> - object (class scala.Tuple2, (1,{1:1.0})); not retrying >>>>>> [ERROR] [TaskSetManager] Task 0.0 in stage 10.0 (TID 22) had a not >>>>>> serializable result: org.apache.mahout.math.RandomAccessSparseVector >>>>>> Serialization stack: >>>>>> - object not serializable (class: >>>>>> org.apache.mahout.math.RandomAccessSparseVector, >>>>>> value: {0:1.0}) >>>>>> - field (class: scala.Tuple2, name: _2, type: class >>>>>> java.lang.Object) >>>>>> - object (class scala.Tuple2, (0,{0:1.0})); not retrying >>>>>> Exception in thread "main" org.apache.spark.SparkException: Job >>>>>> aborted due to stage failure: Task 2.0 in stage 10.0 (TID 24) had a not >>>>>> serializable result: org.apache.mahout.math.RandomAccessSparseVector >>>>>> Serialization stack: >>>>>> - object not serializable (class: >>>>>> org.apache.mahout.math.RandomAccessSparseVector, >>>>>> value: {3:1.0,2:1.0}) >>>>>> - field (class: scala.Tuple2, name: _2, type: class >>>>>> java.lang.Object) >>>>>> - object (class scala.Tuple2, (2,{3:1.0,2:1.0})) >>>>>> at org.apache.spark.scheduler.DAGScheduler.org >>>>>> <http://org.apache.spark.scheduler.dagscheduler.org/>$ >>>>>> apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages( >>>>>> DAGScheduler.scala:1431) >>>>>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$ >>>>>> abortStage$1.apply(DAGScheduler.scala:1419) >>>>>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$ >>>>>> abortStage$1.apply(DAGScheduler.scala:1418) >>>>>> at scala.collection.mutable.ResizableArray$class.foreach( >>>>>> ResizableArray.scala:59) >>>>>> at scala.collection.mutable.ArrayBuffer.foreach( >>>>>> ArrayBuffer.scala:47) >>>>>> at org.apache.spark.scheduler.DAGScheduler.abortStage( >>>>>> DAGScheduler.scala:1418) >>>>>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$ >>>>>> handleTaskSetFailed$1.apply(DAGScheduler.scala:799) >>>>>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$ >>>>>> handleTaskSetFailed$1.apply(DAGScheduler.scala:799) >>>>>> at scala.Option.foreach(Option.scala:236) >>>>>> at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed( >>>>>> DAGScheduler.scala:799) >>>>>> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop. >>>>>> doOnReceive(DAGScheduler.scala:1640) >>>>>> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop. >>>>>> onReceive(DAGScheduler.scala:1599) >>>>>> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop. >>>>>> onReceive(DAGScheduler.scala:1588) >>>>>> at org.apache.spark.util.EventLoop$$anon$1.run( >>>>>> EventLoop.scala:48) >>>>>> at org.apache.spark.scheduler.DAGScheduler.runJob( >>>>>> DAGScheduler.scala:620) >>>>>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) >>>>>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) >>>>>> at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1088) >>>>>> at org.apache.spark.rdd.RDDOperationScope$.withScope( >>>>>> RDDOperationScope.scala:150) >>>>>> at org.apache.spark.rdd.RDDOperationScope$.withScope( >>>>>> RDDOperationScope.scala:111) >>>>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) >>>>>> at org.apache.spark.rdd.RDD.fold(RDD.scala:1082) >>>>>> at org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark.com >>>>>> <http://s.drm.checkpointeddrmspark.com/> >>>>>> puteNRow(CheckpointedDrmSpark.scala:188) >>>>>> at org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark.nrow$ >>>>>> lzycompute(CheckpointedDrmSpark.scala:55) >>>>>> at org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark.nrow( >>>>>> CheckpointedDrmSpark.scala:55) >>>>>> at org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark. >>>>>> newRowCardinality(CheckpointedDrmSpark.scala:219) >>>>>> at com.actionml.IndexedDatasetSpark$.apply(Preparator.scala:213) >>>>>> at com.actionml.Preparator$$anonfun$3.apply(Preparator.scala:71) >>>>>> at com.actionml.Preparator$$anonfun$3.apply(Preparator.scala:49) >>>>>> at scala.collection.TraversableLike$$anonfun$map$ >>>>>> 1.apply(TraversableLike.scala:244) >>>>>> at scala.collection.TraversableLike$$anonfun$map$ >>>>>> 1.apply(TraversableLike.scala:244) >>>>>> at scala.collection.immutable.List.foreach(List.scala:318) >>>>>> at scala.collection.TraversableLike$class.map( >>>>>> TraversableLike.scala:244) >>>>>> at scala.collection.AbstractTraversable.map( >>>>>> Traversable.scala:105) >>>>>> at com.actionml.Preparator.prepare(Preparator.scala:49) >>>>>> at com.actionml.Preparator.prepare(Preparator.scala:32) >>>>>> at org.apache.predictionio.controller.PPreparator. >>>>>> prepareBase(PPreparator.scala:37) >>>>>> at org.apache.predictionio.controller.Engine$.train( >>>>>> Engine.scala:671) >>>>>> at org.apache.predictionio.controller.Engine.train( >>>>>> Engine.scala:177) >>>>>> at org.apache.predictionio.workflow.CoreWorkflow$. >>>>>> runTrain(CoreWorkflow.scala:67) >>>>>> at org.apache.predictionio.workflow.CreateWorkflow$.main( >>>>>> CreateWorkflow.scala:250) >>>>>> at org.apache.predictionio.workflow.CreateWorkflow.main( >>>>>> CreateWorkflow.scala) >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke( >>>>>> NativeMethodAccessorImpl.java:62) >>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke( >>>>>> DelegatingMethodAccessorImpl.java:43) >>>>>> at java.lang.reflect.Method.invoke(Method.java:498) >>>>>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$ >>>>>> deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) >>>>>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1( >>>>>> SparkSubmit.scala:181) >>>>>> at org.apache.spark.deploy.SparkSubmit$.submit( >>>>>> SparkSubmit.scala:206) >>>>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit. >>>>>> scala:121) >>>>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >>>>>> >>>>>> *===========================================================* >>>>>> Thank you all for your help. >>>>>> >>>>>> Best regards, >>>>>> noelia >>>>>> >>>>>> >>>>>