Ted, Is it also possible to use ItemSimilarityJob in a non-distributed environment?
Am 17.04.2014 um 16:22 schrieb Ted Dunning <ted.dunn...@gmail.com>: > Najum, > > You should also be able to use the ItemSimilarityJob to compute a limited > indicator set. > > This is stepping off of the path you have been on, but it would allow you > to deploy the recommender via a search engine. > > That makes a lot of code simply vanish. THis is also a well trod > production path. > > > > > On Thu, Apr 17, 2014 at 3:57 AM, Najum Ali <naju...@googlemail.com> wrote: > >> @Sebastian >> >> wow … you are right. The original csv file is about 21mb and the >> corresponding precomputed item-item similarity file is about 260mb!! >> And yes, there are wide more than 50 "most similar items“ for an item .. >> >> Trying to restrict this to 50 (or something like that) most similar items >> for an item could do the trick as you said. >> Ok I will give it try and reply later. >> >> By the way, what´s about the SampingCandidateItemsStrategy or something >> like this, by using this Constructor: >> *GenericItemBasedRecommender >> <https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.html#GenericItemBasedRecommender(org.apache.mahout.cf.taste.model.DataModel,%20org.apache.mahout.cf.taste.similarity.ItemSimilarity,%20org.apache.mahout.cf.taste.recommender.CandidateItemsStrategy,%20org.apache.mahout.cf.taste.recommender.MostSimilarItemsCandidateItemsStrategy)>* >> (DataModel<https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/model/DataModel.html> >> dataModel, >> ItemSimilarity<https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/similarity/ItemSimilarity.html> >> similarity, >> CandidateItemsStrategy<https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/recommender/CandidateItemsStrategy.html> >> candidateItemsStrategy,MostSimilarItemsCandidateItemsStrategy<https://builds.apache.org/job/mahout-quality/javadoc/org/apache/mahout/cf/taste/recommender/MostSimilarItemsCandidateItemsStrategy.html> >> mostSimilarItemsCandidateItemsStrategy) >> >> >> Am 17.04.2014 um 12:41 schrieb Sebastian Schelter <s...@apache.org>: >> >> Hi Najum, >> >> I think I found the problem. Remember: Two items are similar whenever at >> least one user interacted with both of them ("the items co-occur"). >> >> In the movielens dataset this is true for almost all pairs of items, >> unfortunately. From 3076 items, more than 11 million similarities are >> created. A common approach for that (which is not yet implemented in our >> precomputation unfortunately) is to only retain the top-k similar items per >> item. >> >> A solution would be to take the csv file that is created by the >> MultithreadedBatchItemSimilarities and postprocess it so that only the 50 >> most similar items per item are retained. That should help with your >> problem. >> >> Unfortunately, we don't have code for that yet, maybe you want to try to >> write that yourself? >> >> Best, >> Sebastian >> >> PS: The user-based recommender restricts the number of similar users, I >> guess thats why it is so fast here. >> >> >> On 04/17/2014 12:18 PM, Najum Ali wrote: >> >> Ok, here you go: >> >> I have created a simple class with main-method (no server and other stuff): >> >> public class RecommenderTest { >> public static void main(String[] args) throws IOException, TasteException { >> DataModel dataModel = new FileDataModel(new >> >> File("/Users/najum/Documents/recommender-console/src/main/webapp/resources/preference_csv/1mil.csv")); >> ItemSimilarity similarity = new LogLikelihoodSimilarity(dataModel); >> ItemBasedRecommender recommender = new >> GenericItemBasedRecommender(dataModel, >> similarity); >> >> String pathToPreComputedFile = preComputeSimilarities(recommender, >> dataModel.getNumItems()); >> >> InputStream inputStream = new FileInputStream(new >> File(pathToPreComputedFile)); >> BufferedReader bufferedReader = new BufferedReader(new >> InputStreamReader(inputStream)); >> Collection<GenericItemSimilarity.ItemItemSimilarity> correlations = >> >> bufferedReader.lines().map(mapToItemItemSimilarity).collect(Collectors.toList()); >> ItemSimilarity precomputedSimilarity = new >> GenericItemSimilarity(correlations); >> ItemBasedRecommender recommenderWithPrecomputation = new >> GenericItemBasedRecommender(dataModel, precomputedSimilarity); >> >> recommend(recommender); >> recommend(recommenderWithPrecomputation); >> } >> >> private static String preComputeSimilarities(ItemBasedRecommender >> recommender, >> int simItemsPerItem) throws TasteException { >> String pathToAbsolutePath = ""; >> try { >> File resultFile = new File(System.getProperty("java.io.tmpdir"), >> "similarities.csv"); >> if (resultFile.exists()) { >> resultFile.delete(); >> } >> BatchItemSimilarities batchJob = new >> MultithreadedBatchItemSimilarities(recommender, simItemsPerItem); >> int numSimilarities = >> batchJob.computeItemSimilarities(Runtime.getRuntime().availableProcessors(), >> 1, >> new FileSimilarItemsWriter(resultFile)); >> pathToAbsolutePath = resultFile.getAbsolutePath(); >> System.out.println("Computed " + numSimilarities + " similarities and >> saved them >> to " + pathToAbsolutePath); >> } catch (IOException e) { >> System.out.println("Error while writing pre computed similarities to >> file"); >> } >> return pathToAbsolutePath; >> } >> >> private static void recommend(ItemBasedRecommender recommender) throws >> TasteException { >> long start = System.nanoTime(); >> List<RecommendedItem> recommendations = recommender.recommend(1, 10); >> long end = System.nanoTime(); >> System.out.println("Created recommendations in " + >> getCalculationTimeInMilliseconds(start, end) + " ms. Recommendations:" + >> recommendations); >> } >> >> private static double getCalculationTimeInMilliseconds(long start, long >> end) { >> double calculationTime = (end - start); >> return (calculationTime / 1_000_000); >> } >> >> >> private static Function<String, GenericItemSimilarity.ItemItemSimilarity> >> mapToItemItemSimilarity = (line) -> { >> String[] row = line.split(","); >> return new GenericItemSimilarity.ItemItemSimilarity( >> Long.parseLong(row[0]), Long.parseLong(row[1]), >> Double.parseDouble(row[2])); >> }; >> } >> >> And thats the Output-log: >> >> 3 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - >> Creating FileDataModel for file >> >> /Users/najum/Documents/recommender-console/src/main/webapp/resources/preference_csv/1mil.csv >> 63 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - >> Reading file info... >> 1207 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel - >> Processed 1000000 lines >> 1208 [main] INFO org.apache.mahout.cf.taste.impl.model.file.FileDataModel >> - Read >> lines: 1000209 >> 1475 [main] INFO org.apache.mahout.cf.taste.impl.model.GenericDataModel - >> Processed 6040 users >> 1599 [main] INFO >> >> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities >> - Queued 3706 items in 38 batches >> 10928 [pool-1-thread-8] INFO >> >> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities >> - worker 7 processed 5 batches >> 10928 [pool-1-thread-8] INFO >> >> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities >> - worker 7 processed 5 batches. done. >> 10978 [pool-1-thread-5] INFO >> >> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities >> - worker 4 processed 4 batches. done. >> 11589 [pool-1-thread-4] INFO >> >> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities >> - worker 3 processed 5 batches >> 11589 [pool-1-thread-4] INFO >> >> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities >> - worker 3 processed 5 batches. done. >> 11592 [pool-1-thread-6] INFO >> >> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities >> - worker 5 processed 5 batches >> 11592 [pool-1-thread-6] INFO >> >> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities >> - worker 5 processed 5 batches. done. >> 11707 [pool-1-thread-7] INFO >> >> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities >> - worker 6 processed 5 batches >> 11707 [pool-1-thread-7] INFO >> >> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities >> - worker 6 processed 5 batches. done. >> 11730 [pool-1-thread-3] INFO >> >> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities >> - worker 2 processed 4 batches. done. >> 11849 [pool-1-thread-1] INFO >> >> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities >> - worker 0 processed 5 batches >> 11849 [pool-1-thread-1] INFO >> >> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities >> - worker 0 processed 5 batches. done. >> 11854 [pool-1-thread-2] INFO >> >> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities >> - worker 1 processed 5 batches >> 11854 [pool-1-thread-2] INFO >> >> org.apache.mahout.cf.taste.impl.similarity.precompute.MultithreadedBatchItemSimilarities >> - worker 1 processed 5 batches. done. >> Computed 9174333 similarities and saved them to >> /var/folders/9g/4h38v1tj3ps9j21skc72b56r0000gn/T/similarities.csv >> Created recommendations in *1683.613 >> ms*. Recommendations:[RecommendedItem[item:3890, value:4.6771617], >> RecommendedItem[item:3530, value:4.662509], RecommendedItem[item:127, >> value:4.660716], RecommendedItem[item:3323, value:4.660716], >> RecommendedItem[item:3382, value:4.660716], RecommendedItem[item:3123, >> value:4.603366], RecommendedItem[item:3233, value:4.5707765], >> RecommendedItem[item:1434, value:4.553473], RecommendedItem[item:989, >> value:4.5263577], RecommendedItem[item:2343, value:4.524066]] >> Created recommendations in* 985.679 >> ms.* Recommendations:[RecommendedItem[item:3530, value:5.0], >> RecommendedItem[item:3382, value:5.0], RecommendedItem[item:3890, >> value:4.6771617], RecommendedItem[item:127, value:4.660716], >> RecommendedItem[item:3323, value:4.660716], RecommendedItem[item:3123, >> value:4.603366], RecommendedItem[item:3233, value:4.5707765], >> RecommendedItem[item:1434, value:4.553473], RecommendedItem[item:989, >> value:4.5263577], RecommendedItem[item:2343, value:4.524066]] >> >> Again almost same results. Although what I also don´t understand is, why >> am I >> getting different RecommendItems? >> That really frustrates me… >> >> You can find the Java file in the attachment. >> >> >> >> Greetings from Germany, >> Najum >> >> Am 17.04.2014 um 11:44 schrieb Sebastian Schelter <s...@apache.org >> <mailto:s...@apache.org <s...@apache.org>>>: >> >> Yes, just to make sure the problem is in the mahout code and not in the >> surrounding environment. >> >> On 04/17/2014 11:43 AM, Najum Ali wrote: >> >> @Sebastian >> What do u mean with a standalone recommender? A simple offline java main >> program? >> >> Am 17.04.2014 um 11:41 schrieb Sebastian Schelter <s...@apache.org >> <mailto:s...@apache.org <s...@apache.org>>>: >> >> Could you take the output of the precomputation, feed it into a standalone >> recommender and test it there? >> >> >> On 04/17/2014 11:37 AM, Najum Ali wrote: >> >> @sebastian >> >> Are you sure that the precomputation is done only once and not in every >> request? >> >> Yes, a @Bean annotated Object is in Spring per default a singleton >> instance. >> I also just tested it out using a System.out.println() >> Here is my log: >> >> System.out.println("----> precomputation done!“ is called before returning >> the >> GenericItemSimilarity. >> >> The first two recommendations are Item-based -> pearson similarity >> The thrid and 4th log are also item-based using pre computed similarity >> The last log is the userbased recommender using pearson >> >> Look at the huge time difference! >> >> Am 17.04.2014 um 11:23 schrieb Sebastian Schelter <s...@apache.org >> <mailto:s...@apache.org <s...@apache.org>> >> <mailto:s...@apache.org <s...@apache.org>>>: >> >> Najum, >> >> this is really strange, feeding an ItemBased Recommender with precomputed >> similarities should give you superfast recommendations. >> >> Are you sure that the precomputation is done only once and not in every >> request? >> >> --sebastian >> >> On 04/17/2014 11:17 AM, Najum Ali wrote: >> >> Hi guys, >> >> I have created a precomputed item-item-similarity collection for a >> GenericItemBasedRecommender. >> Using the 1M MovieLens data, my item-based recommender is only 40-50% >> faster >> than without precomputation (like 589.5ms instead 1222.9ms). >> But the user-based recommender instead is really fast, it´s like 24.2ms? >> How can >> this happen? >> >> Here are more details to my Implementation: >> >> CSV File: 1M pref, 6040 Users, 3706 Items >> >> For my Implementation I´m using screenshots, because having the good >> highlighting. >> My Recommender runs inside a Webserver (Jetty) using Spring 4 and Java8. I >> receive Recommendations as Webservice (JSON). >> >> For DataModel, I´m using FileDataModel. >> >> >> This code below creates me a precomputed ItemSimilarity when I start the >> Webserver and the property isItemPreComputationEnabled is set to true: >> >> >> For time measuring I´m using AOP. I´m measuring the whole time from >> entering my >> Controller to sending the response. >> based on System.nanoTime(); and getting the diff. It´s the same time >> measure for >> user based. >> >> I haved tried to cache the recommender and the similarity with no big >> difference. I also tried to use CandidateItemsStrategy and >> MostSimilarItemsCandidateItemsStrategy, but also no performance boost. >> >> public RecommenderBuilder createRecommenderBuilder(ItemSimilarity >> similarity) >> throws TasteException { >> final int numberOfUsers = dataModel.getNumUsers(); >> final int numberOfItems = dataModel.getNumItems(); >> CandidateItemsStrategy candidateItemsStrategy = new >> SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems); >> MostSimilarItemsCandidateItemsStrategy mostSimilarStrategy = new >> SamplingCandidateItemsStrategy(numberOfUsers,numberOfItems); >> return model -> new GenericItemBasedRecommender(model, >> similarity,candidateItemsStrategy,mostSimilarStrategy); >> } >> >> I dont know why item-based is taking so much longer then user-based. >> User-based >> is like fast as hell. I even tried a DataSet using 100k Prefs, and >> 10Million >> (Movielens). Everytime the user-based is soo much faster for any >> similarity. >> >> Hope you anyone can help me to understand this. Maybe I´m doing something >> wrong. >> >> Thanks!! :)) >> >> >> >> >> >> >> >>
signature.asc
Description: Message signed with OpenPGP using GPGMail