Re: Spark 2.x/scala 2.11.x release

2018-02-28 Thread Pat Ferrel
big +1

If you are planning to branch off the 0.13.0 tag let me know, I have a
speedup that is in my scala 2.11 fork of 0.13.0 that needs to be released


From: Andrew Palumbo  
Reply: dev@mahout.apache.org  
Date: February 28, 2018 at 11:16:12 AM
To: dev@mahout.apache.org  
Subject:  Spark 2.x/scala 2.11.x release

After some offline discussion regarding people's needs for Spark and 2.x
and Scala 2.11.x, I am wondering If we should just consider a release for
2.x and 2.11.x as the default. We could release from the current master, or
branch back off of the 0.13.0 tag, and release that with the upgraded
defaults, and branch our current multi-artifact build off as a feature. Any
thoughts on this?


--andy


Re: Dynamically change parameter list

2018-02-12 Thread Pat Ferrel
That would be fine since the model can contain anything. But the real question 
is where you want to use those params. If you need to use them the next time 
you train, you’ll have to persist them to a place read during training. That is 
usually only the metadata store (obviously input events too), which has the 
contents of engine.json. So to get them into the metadata store you may have to 
alter engine.json. 

Unless someone else knows how to alter the metadata directly after `pio train`

One problem is that you will never know what the new params are without putting 
them in a file or logging them. We keep them in a separate place and merge them 
with engine.json explicitly so we can see what is happening. They are 
calculated parameters, not hand made tunings. It seems important to me to keep 
those separate unless you are talking about some type of expected reinforcement 
learning, not really params but an evolving model.
 

On Feb 12, 2018, at 2:48 PM, Tihomir Lolić <tihomir.lo...@gmail.com> wrote:

Thank you very much for the answer. I'll try with customizing workflow. There 
is a step where Seq of models is returned. My idea is to return model and model 
parameters in this step. I'll let you know if it works.

Thanks,
Tihomie

On Feb 12, 2018 23:34, "Pat Ferrel" <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
This is an interesting question. As we make more mature full featured engines 
they will begin to employ hyper parameter search techniques or reinforcement 
params. This means that there is a new stage in the workflow or a feedback loop 
not already accounted for.

Short answer is no, unless you want to re-write your engine.json after every 
train and probably keep the old one for safety. You must re-train to get the 
new params put into the metastore and therefor available to your engine.

What we do for the Universal Recommender is have a special new workflow phase, 
call it a self-tuning phase, where we search for the right tuning of 
parameters. This it done with code that runs outside of pio and creates 
parameters that go into the engine.json. This can be done periodically to make 
sure the tuning is still optimal.

Not sure whether feedback or hyper parameter search is the best architecture 
for you.


From: Tihomir Lolić <tihomir.lo...@gmail.com> <mailto:tihomir.lo...@gmail.com>
Reply: user@predictionio.apache.org <mailto:user@predictionio.apache.org> 
<user@predictionio.apache.org> <mailto:user@predictionio.apache.org>
Date: February 12, 2018 at 2:02:48 PM
To: user@predictionio.apache.org <mailto:user@predictionio.apache.org> 
<user@predictionio.apache.org> <mailto:user@predictionio.apache.org>
Subject:  Dynamically change parameter list 

> Hi,
> 
> I am trying to figure out how to dynamically update algorithm parameter list. 
> After the train is finished only model is updated. The reason why I need this 
> data to be updated is that I am creating data mapping based on the training 
> data. Is there a way to update this data after the train is done?
> 
> Here is the code that I am using. The variable that and should be updated 
> after the train is marked bold red.
> 
> import io.prediction.controller.{EmptyParams, EngineParams}
> import io.prediction.data.storage.EngineInstance
> import io.prediction.workflow.CreateWorkflow.WorkflowConfig
> import io.prediction.workflow._
> import org.apache.spark.ml.linalg.SparseVector
> import org.joda.time.DateTime
> import org.json4s.JsonAST._
> 
> import scala.collection.mutable
> 
> object TrainApp extends App {
> 
>   val envs = Map("FOO" -> "BAR")
> 
>   val sparkEnv = Map("spark.master" -> "local")
> 
>   val sparkConf = Map("spark.executor.extraClassPath" -> ".")
> 
>   val engineFactoryName = "LogisticRegressionEngine"
> 
>   val workflowConfig = WorkflowConfig(
> engineId = EngineConfig.engineId,
> engineVersion = EngineConfig.engineVersion,
> engineVariant = EngineConfig.engineVariantId,
> engineFactory = engineFactoryName
>   )
> 
>   val workflowParams = WorkflowParams(
> verbose = workflowConfig.verbosity,
> skipSanityCheck = workflowConfig.skipSanityCheck,
> stopAfterRead = workflowConfig.stopAfterRead,
> stopAfterPrepare = workflowConfig.stopAfterPrepare,
> sparkEnv = WorkflowParams().sparkEnv ++ sparkEnv
>   )
> 
>   WorkflowUtils.modifyLogging(workflowConfig.verbose)
> 
>   val dataSourceParams = DataSourceParams(sys.env.get("APP_NAME").get)
>   val preparatorParams = EmptyParams()
> 
>   val algorithmParamsList = Seq("Logistic" -> LogisticParams(columns = 
> Array[String](),
>

Re: Dynamically change parameter list

2018-02-12 Thread Pat Ferrel
This is an interesting question. As we make more mature full featured
engines they will begin to employ hyper parameter search techniques or
reinforcement params. This means that there is a new stage in the workflow
or a feedback loop not already accounted for.

Short answer is no, unless you want to re-write your engine.json after
every train and probably keep the old one for safety. You must re-train to
get the new params put into the metastore and therefor available to your
engine.

What we do for the Universal Recommender is have a special new workflow
phase, call it a self-tuning phase, where we search for the right tuning of
parameters. This it done with code that runs outside of pio and creates
parameters that go into the engine.json. This can be done periodically to
make sure the tuning is still optimal.

Not sure whether feedback or hyper parameter search is the best
architecture for you.


From: Tihomir Lolić  
Reply: user@predictionio.apache.org 

Date: February 12, 2018 at 2:02:48 PM
To: user@predictionio.apache.org 

Subject:  Dynamically change parameter list

Hi,

I am trying to figure out how to dynamically update algorithm parameter
list. After the train is finished only model is updated. The reason why I
need this data to be updated is that I am creating data mapping based on
the training data. Is there a way to update this data after the train is
done?

Here is the code that I am using. The variable that and should be updated
after the train is marked *bold red.*

import io.prediction.controller.{EmptyParams, EngineParams}
import io.prediction.data.storage.EngineInstance
import io.prediction.workflow.CreateWorkflow.WorkflowConfig
import io.prediction.workflow._
import org.apache.spark.ml.linalg.SparseVector
import org.joda.time.DateTime
import org.json4s.JsonAST._

import scala.collection.mutable

object TrainApp extends App {

  val envs = Map("FOO" -> "BAR")

  val sparkEnv = Map("spark.master" -> "local")

  val sparkConf = Map("spark.executor.extraClassPath" -> ".")

  val engineFactoryName = "LogisticRegressionEngine"

  val workflowConfig = WorkflowConfig(
engineId = EngineConfig.engineId,
engineVersion = EngineConfig.engineVersion,
engineVariant = EngineConfig.engineVariantId,
engineFactory = engineFactoryName
  )

  val workflowParams = WorkflowParams(
verbose = workflowConfig.verbosity,
skipSanityCheck = workflowConfig.skipSanityCheck,
stopAfterRead = workflowConfig.stopAfterRead,
stopAfterPrepare = workflowConfig.stopAfterPrepare,
sparkEnv = WorkflowParams().sparkEnv ++ sparkEnv
  )

  WorkflowUtils.modifyLogging(workflowConfig.verbose)

  val dataSourceParams = DataSourceParams(sys.env.get("APP_NAME").get)
  val preparatorParams = EmptyParams()

  *val algorithmParamsList = Seq("Logistic" -> LogisticParams(columns =
Array[String](),*
*  dataMapping
= Map[String, Map[String, SparseVector]]()))*
  val servingParams = EmptyParams()

  val engineInstance = EngineInstance(
id = "",
status = "INIT",
startTime = DateTime.now,
endTime = DateTime.now,
engineId = workflowConfig.engineId,
engineVersion = workflowConfig.engineVersion,
engineVariant = workflowConfig.engineVariant,
engineFactory = workflowConfig.engineFactory,
batch = workflowConfig.batch,
env = envs,
sparkConf = sparkConf,
dataSourceParams =
JsonExtractor.paramToJson(workflowConfig.jsonExtractor,
workflowConfig.engineParamsKey -> dataSourceParams),
preparatorParams =
JsonExtractor.paramToJson(workflowConfig.jsonExtractor,
workflowConfig.engineParamsKey -> preparatorParams),
algorithmsParams =
JsonExtractor.paramsToJson(workflowConfig.jsonExtractor,
algorithmParamsList),
servingParams = JsonExtractor.paramToJson(workflowConfig.jsonExtractor,
workflowConfig.engineParamsKey -> servingParams)
  )

  val (engineLanguage, engineFactory) =
WorkflowUtils.getEngine(engineInstance.engineFactory,
getClass.getClassLoader)

  val engine = engineFactory()

  val engineParams = EngineParams(
dataSourceParams = dataSourceParams,
preparatorParams = preparatorParams,
algorithmParamsList = algorithmParamsList,
servingParams = servingParams
  )

  val engineInstanceId = CreateServer.engineInstances.insert(engineInstance)

  CoreWorkflow.runTrain(
env = envs,
params = workflowParams,
engine = engine,
engineParams = engineParams,
engineInstance = engineInstance.copy(id = engineInstanceId)
  )

  CreateServer.actorSystem.shutdown()
}


Thank you,
Tihomir


Re: pio train on Amazon EMR

2018-02-05 Thread Pat Ferrel
I agree, we looked at using EMR and found that we liked some custom Terraform + 
Docker much better. The existing EMR defined by AWS requires refactoring PIO or 
using it in yarn’s cluster mode. EMR is not meant to host any application code 
except what is sent into Spark in serialized form. However PIO expects to run 
the Spark “Driver” in the PIO process, which means on the PIO server machine. 

It is possible to make PIO use yarn’s cluster mode to serialize the “Driver” 
too but this is fairly complicated. I think I’ve seen Donald explain it before 
but we chose not to do this. For one thing optimizing and tuning yarn managed 
Spark changes the meaning of some tuning parameters.

Spark is moving to Kubernetes as a replacement for Yarn so we are quite 
interested in following that line of development.

One last thought on EMR: It was designed originally for Hadoop’s MapReduce. 
That meant that for a long time you couldn’t get big memory machines in EMR 
(you can now). So the EMR team in AWS does not seem to target Spark or other 
clustered services as well as they could. This is another reason we decided it 
wasn’t worth the trouble.


From: Mars Hall 
Reply: user@predictionio.apache.org 
Date: February 5, 2018 at 11:45:46 AM
To: user@predictionio.apache.org 
Subject:  Re: pio train on Amazon EMR  

Hi Malik,

This is a topic I've been investigating as well.

Given how EMR manages its clusters & their runtime, I don't think hacking 
configs to make the PredictionIO host act like a cluster member will be a 
simple or sustainable approach.

PredictionIO already operates Spark by building `spark-submit` commands.
  
https://github.com/apache/predictionio/blob/df406bf92463da4a79c8d84ec0ca439feaa0ec7f/tools/src/main/scala/org/apache/predictionio/tools/Runner.scala#L313

Implementing a new AWS EMR command runner in PredictionIO, so that we can 
switch `pio train` from the existing, plain `spark-submit` command to using the 
AWS CLI, `aws emr add-steps --steps Args=spark-submit` would likely solve a big 
part of this problem.
  https://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html

Also, uploading the engine assembly JARs (the job code to run on Spark) to the 
cluster members or S3 for access from the EMR Spark runtime will be another 
part of this challenge.

On Mon, Feb 5, 2018 at 5:29 AM, Malik Twain  wrote:
I'm trying to run pio train with Amazon EMR. I copied core-site.xml and 
yarn-site.xml from EMR to my training machine, and configured HADOOP_CONF_DIR 
in pio-env.sh accordingly.

I'm running pio train as below:

pio train -- --master yarn --deploy-mode cluster

It's failing with the following errors:

18/02/05 11:56:15 INFO Client: 
   client token: N/A
   diagnostics: Application application_1517819705059_0007 failed 2 times due 
to AM Container for appattempt_1517819705059_0007_02 exited with  exitCode: 
1
Diagnostics: Exception from container-launch.

And below are the errors from EMR stdout and stderr respectively:

java.io.FileNotFoundException: /root/pio.log (Permission denied)
[ERROR] [CreateWorkflow$] Error reading from file: File 
file:/quickstartapp/MyExample/engine.json does not exist. Aborting workflow.

Thank you.



--
*Mars Hall
415-818-7039
Customer Facing Architect
Salesforce Platform / Heroku
San Francisco, California

Re: Frequent Pattern Mining - No engine found. Your build might have failed. Aborting.

2018-02-01 Thread Pat Ferrel
This list is for support of ActionML products, not general PIO support. You can 
get that on the Apache PIO user mailing list, where I have forwarded this 
question.

Several uses of FPM are supported by the Universal Recommender, such as 
Shopping cart recommendations. That is a template we support.


From: dee...@infosoftni.com 
Date: February 1, 2018 at 2:51:01 AM
To: actionml-user 
Subject:  Frequent Pattern Mining - No engine found. Your build might have 
failed. Aborting.  

I am using Frequent pattern mining template and got following error. No engine 
found. 

Please advice. 


s5@AMOL-PATIL:~/Documents/DataSheet/Templates/pio-template-fpm$ pio build 
--verbose
[INFO] [Engine$] Using command 
'/home/s5/Documents/DataSheet/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/sbt/sbt'
 at /home/s5/Documents/DataSheet/Templates/pio-template-fpm to build.
[INFO] [Engine$] If the path above is incorrect, this process will fail.
[INFO] [Engine$] Uber JAR disabled. Making sure 
lib/pio-assembly-0.12.0-incubating.jar is absent.
[INFO] [Engine$] Going to run: 
/home/s5/Documents/DataSheet/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/sbt/sbt
  package assemblyPackageDependency in 
/home/s5/Documents/DataSheet/Templates/pio-template-fpm
[INFO] [Engine$] [info] Loading project definition from 
/home/s5/Documents/DataSheet/Templates/pio-template-fpm/project
[INFO] [Engine$] [info] Set current project to pio-template-text-clustering (in 
build file:/home/s5/Documents/DataSheet/Templates/pio-template-fpm/)
[INFO] [Engine$] [success] Total time: 1 s, completed 1 Feb, 2018 4:13:41 PM
[INFO] [Engine$] [info] Including from cache: scala-library.jar
[INFO] [Engine$] [info] Checking every *.class/*.jar file's SHA-1.
[INFO] [Engine$] [info] Merging files...
[INFO] [Engine$] [warn] Merging 'META-INF/MANIFEST.MF' with strategy 'discard'
[INFO] [Engine$] [warn] Strategy 'discard' was applied to a file
[INFO] [Engine$] [info] Assembly up to date: 
/home/s5/Documents/DataSheet/Templates/pio-template-fpm/target/scala-2.10/pio-template-text-clustering-assembly-0.1-SNAPSHOT-deps.jar
[INFO] [Engine$] [success] Total time: 1 s, completed 1 Feb, 2018 4:13:42 PM
[INFO] [Engine$] Compilation finished successfully.
[INFO] [Engine$] Looking for an engine...
[ERROR] [Engine$] No engine found. Your build might have failed. Aborting.
s5@AMOL-PATIL:~/Documents/DataSheet/Templates/pio-template-fpm$


--
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/f193dd54-85a7-4598-88fe-fb7c74644f11%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: PIO error

2018-01-23 Thread Pat Ferrel
Unfortunately I can’t possibly guess without more information.

What do the logs say when pio cannot be started? Are all these pio
instances separate, not in a cluster? In other words does each pio server
have all necessary services running on them? I assume none is sleeping like
a laptop does?

I you are worrying, when properly configured PIO is quite stable on servers
that do not sleep. I have never seen a bug that would cause this and have
installed it hundreds of time so lets look through logs and check your
pio-env.sh on a particular machine that is having this problem.


From: bala vivek  
Date: January 22, 2018 at 11:32:17 PM
To: actionml-user 

Subject:  Re: PIO error

Hi Pat,

The PIO has installed on the Ubuntu server, the Dev server and
production servers are hosted in other countries and we are connecting
through VPN from my laptop.
And yes if I do a pio-start-all and pio-stop-all resolves the issue always,
but this issue is re-occurring often and sometimes the PIO service is not
coming up even after multiple Pio restart.

Not sure with the core reason why the service is often getting down.

Regards,
Bala

On Tuesday, January 23, 2018 at 2:47:26 AM UTC+5:30, pat wrote:
>
> If you are using a laptop for a dev machine, when it sleeps it can
> interfere with Zookeeper, which is started and used by HBase. So
> pio-stop-all then pio-start-all restarts HBase and therefor Zookeeper
> gracefully to solve this.
>
> Does the stop/start always solve this?
>
>
>
> From: bala vivek 
> Date: January 21, 2018 at 10:39:31 PM
> To: actionml-user 
> Subject:  PIO error
>
> Hi,
>
> I'm getting the following error in pio.
>
> pio status gives me the below result,
>
> [INFO] [Console$] Inspecting PredictionIO...
> [INFO] [Console$] PredictionIO 0.10.0-incubating is installed at
> /opt/tools/PredictionIO-0.10.0-incubating
> [INFO] [Console$] Inspecting Apache Spark...
> [INFO] [Console$] Apache Spark is installed at
> /opt/tools/PredictionIO-0.10.0-incubating/vendors/spark-1.
> 6.3-bin-hadoop2.6
> [INFO] [Console$] Apache Spark 1.6.3 detected (meets minimum requirement
> of 1.3.0)
> [INFO] [Console$] Inspecting storage backend connections...
> [INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
> [INFO] [Storage$] Verifying Model Data Backend (Source: LOCALFS)...
> [INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)...
> [INFO] [Storage$] Test writing to Event Store (App Id 0)...
> [ERROR] [Console$] Unable to connect to all storage backends successfully.
> The following shows the error message from the storage backend.
> [ERROR] [Console$] Failed after attempts=1, exceptions:
> Mon Jan 22 01:00:02 EST 2018, org.apache.hadoop.hbase.
> client.RpcRetryingCaller@5c5d6175, org.apache.hadoop.hbase.ipc.
> RemoteWithExtrasException(org.apache.hadoop.hbase.PleaseHoldException):
> org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
>at org.apache.hadoop.hbase.master.HMaster.
> checkInitialized(HMaster.java:2293)
>at org.apache.hadoop.hbase.master.HMaster.checkNamespaceManagerReady(
> HMaster.java:2298)
>at org.apache.hadoop.hbase.master.HMaster.listNamespaceDescriptors(
> HMaster.java:2536)
>at org.apache.hadoop.hbase.master.MasterRpcServices.
> listNamespaceDescriptors(MasterRpcServices.java:1100)
>at org.apache.hadoop.hbase.protobuf.generated.
> MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:55734)
>at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2180)
>at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
>at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
> RpcExecutor.java:133)
>at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
>at java.lang.Thread.run(Thread.java:748)
>
>  (org.apache.hadoop.hbase.client.RetriesExhaustedException)
> [ERROR] [Console$] Dumping configuration of initialized storage backend
> sources. Please make sure they are correct.
> [ERROR] [Console$] Source Name: ELASTICSEARCH; Type: elasticsearch;
> Configuration: TYPE -> elasticsearch, HOME -> /opt/tools/PredictionIO-0.10.
> 0-incubating/vendors/elasticsearch-1.7.3
> [ERROR] [Console$] Source Name: LOCALFS; Type: localfs; Configuration:
> PATH -> /root/.pio_store/models, TYPE -> localfs
> [ERROR] [Console$] Source Name: HBASE; Type: hbase; Configuration: TYPE ->
> hbase, HOME -> /opt/tools/PredictionIO-0.10.0-incubating/vendors/hbase-1.
> 2.4
>
>
> This setup is running in our production and this is not a new setup. Often
> I get this error and if do a pio-stop-all and pio-start-all, pio will work
> fine.
> But why often the pio status is showing error. There was no new
> configuration changes made in the pio-envi.sh file
> --
> You received this message because you are subscribed to the Google Groups
> "actionml-user" group.
> To 

Re: Prediction IO install failed in Linux

2018-01-23 Thread Pat Ferrel
This would be very difficult to do. Even if you used a machine connected to
the internet to download things like pio, spark, etc. the very build tools
used (sbt) expect to be able to get code from various repositories on the
internet. To build templates would further complicate this since each
template may have different needs.

Perhaps you can take a laptop home, install and build, take it back to work
with all needed code installed. In order to use open source software it is
virtually impossible to work without access to the internet.


From: Praveen Prasannakumar 

Reply: user@predictionio.apache.org 

Date: January 23, 2018 at 7:03:27 AM
To: user@predictionio.apache.org 

Subject:  Re: Prediction IO install failed in Linux

Team - Is there a way to install predictio io offline ? If yes , Can
someone provide some documents for it ?

Thanks
Praveen

On Fri, Jan 19, 2018 at 11:05 AM, Praveen Prasannakumar <
praveen2399wo...@gmail.com> wrote:

> Hello Team
>
> I am trying to install prediction IO in one of our linux box with in
> office network. My company network have firewall and sometimes it wont
> connect to outside servers. I am not sure whether that is the reason on
> failure while executing make-distribution.sh script. Can you please help me
> to figure out how can I install prediction IO with in my office network ?
>
> Attaching the screenshot with error.
>
> ​
>
> Thanks
> Praveen
>


ii_jclhr7su0_1610ce9f32410c38
Description: Binary data


Re: Need Help Setting up prediction IO

2018-01-17 Thread Pat Ferrel
PIO uses Postgres, MySQL or other JDBC database from the SQL DBs or (and I 
always use this) HBase. Hbase is a high performance NoSQL DB that scales 
indefinitely.

It is possible to use any DB if you write an EventStore class for it, wrapping 
the DB calls with a virtualization API that is DB independent.

Memory is completely algorithm and data dependent but expect PIO, which uses 
Sparkm which in turn gets it’s speed from keeping data in-memory, to use a lot 
compared to a web server. PIO apps are often in the big data category and many 
deployments require Spark clusters with many G per machine. It is rare to be 
able to run PIO in production on a single machine.

Welcome to big data.


On Jan 11, 2018, at 6:23 PM, Rajesh Jangid <raje...@grazitti.com> wrote:

Hi, 
Well with version PIO 10 I think some dependency is causing trouble in 
linux, we have figured out a way using Pio for now, and everything is working 
great. 
  Thanks for the support though. 

Few question-
1.Does Pio latest support Mongodb or NoSQL?
2.Memory uses by Pio, Is there any max memory limit set, If need be can it be 
set? 


Thanks
Rajesh 


On Jan 11, 2018 10:25 PM, "Pat Ferrel" <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
The version in the artifact built by Scala should only have the major version 
number so 2.10 or 2.11. PIO 0.10.0 needs 2.10.  Where, and what variable did 
you set to 2.10.4? That is the problem. There will never be a lib built for 
2.10.4, it will always be 2.10.



On Jan 11, 2018, at 5:15 AM, Daniel O' Shaughnessy <danieljamesda...@gmail.com 
<mailto:danieljamesda...@gmail.com>> wrote:

Basically you need to make sure all your lib dependencies in build.sbt work 
together. 

On Thu, 11 Jan 2018 at 13:14 Daniel O' Shaughnessy <danieljamesda...@gmail.com 
<mailto:danieljamesda...@gmail.com>> wrote:
Maybe try v2.10.4 based on this line:

[INFO] [Console$] [error]com.chuusai:shapeless _2.10, _2.10.4

I'm unfamiliar with the ubuntu setup for pio so can't help you there I'm afraid.

On Thu, 11 Jan 2018 at 05:08 Rajesh Jangid <raje...@grazitti.com 
<mailto:raje...@grazitti.com>> wrote:
I am trying to run this on ubuntu 16.04

On Thu, Jan 11, 2018 at 10:36 AM, Rajesh Jangid <raje...@grazitti.com 
<mailto:raje...@grazitti.com>> wrote:
Hi, 
  I have tried once again with 2.10 as well but getting following dependency 
error

[INFO] [Console$] [error] Modules were resolved with conflicting cross-version 
suffixes in 
{file:/home/integration/client/PredictionIO-0.10/Engines/MyRecommendation/}myrecommendation:
[INFO] [Console$] [error]com.chuusai:shapeless _2.10, _2.10.4
[INFO] [Console$] java.lang.RuntimeException: Conflicting cross-version 
suffixes in: com.chuusai:shapeless
[INFO] [Console$] at scala.sys.package$.error(package.scala:27)
[INFO] [Console$] at 
sbt.ConflictWarning$.processCrossVersioned(ConflictWarning.scala:46)
[INFO] [Console$] at sbt.ConflictWarning$.apply(ConflictWarning.scala:32)
[INFO] [Console$] at sbt.Classpaths$$anonfun$100.apply(Defaults.scala:1300)
[INFO] [Console$] at sbt.Classpaths$$anonfun$100.apply(Defaults.scala:1297)
[INFO] [Console$] at 
scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
[INFO] [Console$] at 
sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
[INFO] [Console$] at sbt.std.Transform$$anon$4.work 
<http://4.work/>(System.scala:63)
[INFO] [Console$] at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
[INFO] [Console$] at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
[INFO] [Console$] at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
[INFO] [Console$] at sbt.Execute.work 
<http://sbt.execute.work/>(Execute.scala:237)
[INFO] [Console$] at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
[INFO] [Console$] at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
[INFO] [Console$] at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
[INFO] [Console$] at 
sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
[INFO] [Console$] at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[INFO] [Console$] at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[INFO] [Console$] at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[INFO] [Console$] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[INFO] [Console$] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[INFO] [Console$] at java.lang.Thread.run(Thread.java:745)
[INFO] [Console$] [error] (*:update) Conflicting cross-version suffixes in: 
com.chuusai:shapeless
[INFO] [Console$] [error] Total time: 6 s, completed Jan 11, 2018 5:03:51 AM
[ERROR] [Console$] Return code of previous step i

The Universal Recommender v0.7.0

2018-01-17 Thread Pat Ferrel
We have been waiting to release the UR v0.7.0 for testing (done) and the 
release of Mahout v0.13.1 (not done) Today we have released the UR v0.7.0 
anyway. This comes with:
Support for PIO v0.12.0
Requires Scala 2.11 (can be converted to use Scala 2.10 but it’s a manual 
process)
Requires Elasticsearch 5.X, and uses the REST client exclusively. This enables 
Elasticsearch authentication if needed.
Speed improvements for queries (ES 5.x is faster) and model building (a 
snapshot build of Mahout includes speedups)
Requires a source build of Mahout from a version forked by ActionML. This 
requirement will be removed as soon as Mahout releases v0.13.1, which will be 
incorporated in UR v0.7.1 asap. Follow special build instructions in the UR’s 
README.md.
Fixes a bug in the business rules for excluding items with certain properties

Report issues on the GitHub repo here: 
https://github.com/actionml/universal-recommender 
 get tag v0.7.0 for `pio 
build` and be sure to read the instructions and warnings on the README.md there.

Ask questions on the Google Group here: 
https://groups.google.com/forum/#!forum/actionml-user 
 or on the PIO user list.

Re: Need Help Setting up prediction IO

2018-01-11 Thread Pat Ferrel
The version in the artifact built by Scala should only have the major version 
number so 2.10 or 2.11. PIO 0.10.0 needs 2.10.  Where, and what variable did 
you set to 2.10.4? That is the problem. There will never be a lib built for 
2.10.4, it will always be 2.10.



On Jan 11, 2018, at 5:15 AM, Daniel O' Shaughnessy  
wrote:

Basically you need to make sure all your lib dependencies in build.sbt work 
together. 

On Thu, 11 Jan 2018 at 13:14 Daniel O' Shaughnessy > wrote:
Maybe try v2.10.4 based on this line:

[INFO] [Console$] [error]com.chuusai:shapeless _2.10, _2.10.4

I'm unfamiliar with the ubuntu setup for pio so can't help you there I'm afraid.

On Thu, 11 Jan 2018 at 05:08 Rajesh Jangid > wrote:
I am trying to run this on ubuntu 16.04

On Thu, Jan 11, 2018 at 10:36 AM, Rajesh Jangid > wrote:
Hi, 
  I have tried once again with 2.10 as well but getting following dependency 
error

[INFO] [Console$] [error] Modules were resolved with conflicting cross-version 
suffixes in 
{file:/home/integration/client/PredictionIO-0.10/Engines/MyRecommendation/}myrecommendation:
[INFO] [Console$] [error]com.chuusai:shapeless _2.10, _2.10.4
[INFO] [Console$] java.lang.RuntimeException: Conflicting cross-version 
suffixes in: com.chuusai:shapeless
[INFO] [Console$] at scala.sys.package$.error(package.scala:27)
[INFO] [Console$] at 
sbt.ConflictWarning$.processCrossVersioned(ConflictWarning.scala:46)
[INFO] [Console$] at sbt.ConflictWarning$.apply(ConflictWarning.scala:32)
[INFO] [Console$] at sbt.Classpaths$$anonfun$100.apply(Defaults.scala:1300)
[INFO] [Console$] at sbt.Classpaths$$anonfun$100.apply(Defaults.scala:1297)
[INFO] [Console$] at 
scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
[INFO] [Console$] at 
sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
[INFO] [Console$] at sbt.std.Transform$$anon$4.work 
(System.scala:63)
[INFO] [Console$] at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
[INFO] [Console$] at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
[INFO] [Console$] at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
[INFO] [Console$] at sbt.Execute.work 
(Execute.scala:237)
[INFO] [Console$] at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
[INFO] [Console$] at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
[INFO] [Console$] at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
[INFO] [Console$] at 
sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
[INFO] [Console$] at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[INFO] [Console$] at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[INFO] [Console$] at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[INFO] [Console$] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[INFO] [Console$] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[INFO] [Console$] at java.lang.Thread.run(Thread.java:745)
[INFO] [Console$] [error] (*:update) Conflicting cross-version suffixes in: 
com.chuusai:shapeless
[INFO] [Console$] [error] Total time: 6 s, completed Jan 11, 2018 5:03:51 AM
[ERROR] [Console$] Return code of previous step is 1. Aborting.


On Wed, Jan 10, 2018 at 10:03 PM, Daniel O' Shaughnessy 
> wrote:
I've pulled down this version without any modifications and run with pio v0.10 
on a mac and it builds with no issues.

However, when I add in scalaVersion := "2.11.8" to build.sbt I get a dependency 
error.

pio v0.10 supports scala 2.10 so you need to switch to this to run! 

On Wed, 10 Jan 2018 at 13:47 Rajesh Jangid > wrote:
Yes, v0.5.0

On Jan 10, 2018 7:07 PM, "Daniel O' Shaughnessy" > wrote:
Is this the template you're using? 

https://github.com/apache/predictionio-template-ecom-recommender 


On Wed, 10 Jan 2018 at 13:16 Rajesh Jangid > wrote:
Yes, 
We have dependency with elastic and we have elastic 1.4.4 already running. 
We Do not want to run another elastic instance.
Latest prediction IO does not support elastic 1.4.4


On Wed, Jan 10, 2018 at 6:25 PM, Daniel O' Shaughnessy 
> wrote:
Strangedo you absolutely need to run this with pio v0.10? 

On Wed, 10 Jan 2018 at 12:50 Rajesh Jangid 

Re: Using Dataframe API vs. RDD API?

2018-01-05 Thread Pat Ferrel
Yes and I do not recommend that because the EventServer schema is not a 
developer contract. It may change at any time. Use the conversion method and go 
through the PIO API to get the RDD then convert to DF for now.

I’m not sure what PIO uses to get an RDD from Postgres but if they do not use 
something like the lib you mention, a PR would be nice. Also if you have an 
interest in adding the DF APIs to the EventServer contributions are encouraged. 
Committers will give some guidance I’m sure—once that know more than me on the 
subject.

If you want to donate some DF code, create a Jira and we’ll easily find a 
mentor to make suggestions. There are many benefits to this including not 
having to support a fork of PIO through subsequent versions. Also others are 
interested in this too.

 

On Jan 5, 2018, at 7:39 AM, Daniel O' Shaughnessy <danieljamesda...@gmail.com> 
wrote:

Should have mentioned that I used org.apache.spark.rdd.JdbcRDD to read in 
the RDD from a postgres DB initially.

This was you don't need to use an EventServer!

On Fri, 5 Jan 2018 at 15:37 Daniel O' Shaughnessy <danieljamesda...@gmail.com 
<mailto:danieljamesda...@gmail.com>> wrote:
Hi Shane, 

I've successfully used : 

import org.apache.spark.ml.classification.{ RandomForestClassificationModel, 
RandomForestClassifier }

with pio. You can access feature importance through the RandomForestClassifier 
also.

Very simple to convert RDDs to DFs as Pat mentioned, something like:

val RDD_2_DF = sqlContext.createDataFrame(yourRDD).toDF("col1", "col2")



On Thu, 4 Jan 2018 at 23:10 Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
Actually there are libs that will read DFs from HBase 
https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html
 
<https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html>

This is out of band with PIO and should not be used IMO because the schema of 
the EventStore is not guaranteed to remain as-is. The safest way is to 
translate or get DFs integrated to PIO. I think there is an existing Jira that 
request Spark ML support, which assumes DFs. 


On Jan 4, 2018, at 12:25 PM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:

Funny you should ask this. Yes, we are working on a DF based Universal 
Recommender but you have to convert the RDD into a DF since PIO does not read 
out data in the form of a DF (yet). This is a fairly simple step of maybe one 
line of code but would be better supported in PIO itself. The issue is that the 
EventStore uses libs that may not read out DFs, but RDDs. This is certainly the 
case with Elasticsearch, which provides an RDD lib. I haven’t seen one from 
them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for 
one would +1


On Jan 4, 2018, at 11:55 AM, Shane Johnson <shanewaldenjohn...@gmail.com 
<mailto:shanewaldenjohn...@gmail.com>> wrote:

Hello group, Happy new year! Does anyone have a working example or template 
using the DataFrame API vs. the RDD based APIs. We are wanting to migrate to 
using the new DataFrame APIs to take advantage of the Feature Importance 
function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{RandomForestRegressionModel, 
RandomForestRegressor}

Is this something that should be fairly straightforward by adjusting parameters 
and calling new classes within DASE or is it much more involved development.

Thank You!
Shane Johnson | 801.360.3350 <tel:(801)%20360-3350>
LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook 
<https://www.facebook.com/shane.johnson.71653>




Re: Using Dataframe API vs. RDD API?

2018-01-04 Thread Pat Ferrel
Actually there are libs that will read DFs from HBase 
https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html
 
<https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html>

This is out of band with PIO and should not be used IMO because the schema of 
the EventStore is not guaranteed to remain as-is. The safest way is to 
translate or get DFs integrated to PIO. I think there is an existing Jira that 
request Spark ML support, which assumes DFs. 


On Jan 4, 2018, at 12:25 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

Funny you should ask this. Yes, we are working on a DF based Universal 
Recommender but you have to convert the RDD into a DF since PIO does not read 
out data in the form of a DF (yet). This is a fairly simple step of maybe one 
line of code but would be better supported in PIO itself. The issue is that the 
EventStore uses libs that may not read out DFs, but RDDs. This is certainly the 
case with Elasticsearch, which provides an RDD lib. I haven’t seen one from 
them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for 
one would +1


On Jan 4, 2018, at 11:55 AM, Shane Johnson <shanewaldenjohn...@gmail.com 
<mailto:shanewaldenjohn...@gmail.com>> wrote:

Hello group, Happy new year! Does anyone have a working example or template 
using the DataFrame API vs. the RDD based APIs. We are wanting to migrate to 
using the new DataFrame APIs to take advantage of the Feature Importance 
function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{RandomForestRegressionModel, 
RandomForestRegressor}

Is this something that should be fairly straightforward by adjusting parameters 
and calling new classes within DASE or is it much more involved development.

Thank You!
Shane Johnson | 801.360.3350

LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook 
<https://www.facebook.com/shane.johnson.71653>



Re: Using Dataframe API vs. RDD API?

2018-01-04 Thread Pat Ferrel
Funny you should ask this. Yes, we are working on a DF based Universal 
Recommender but you have to convert the RDD into a DF since PIO does not read 
out data in the form of a DF (yet). This is a fairly simple step of maybe one 
line of code but would be better supported in PIO itself. The issue is that the 
EventStore uses libs that may not read out DFs, but RDDs. This is certainly the 
case with Elasticsearch, which provides an RDD lib. I haven’t seen one from 
them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for 
one would +1


On Jan 4, 2018, at 11:55 AM, Shane Johnson  wrote:

Hello group, Happy new year! Does anyone have a working example or template 
using the DataFrame API vs. the RDD based APIs. We are wanting to migrate to 
using the new DataFrame APIs to take advantage of the Feature Importance 
function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{RandomForestRegressionModel, 
RandomForestRegressor}

Is this something that should be fairly straightforward by adjusting parameters 
and calling new classes within DASE or is it much more involved development.

Thank You!
Shane Johnson | 801.360.3350

LinkedIn  | Facebook 



Re: Error: "unable to undeploy"

2018-01-03 Thread Pat Ferrel
The UR does not require more than one deploy (assuming the server runs 
forever). Retraining the UR automatically re-deploys the new model. 

All other Engines afaik do require retrain-redeploy.

Users should be aware that PIO is a framework that provides no ML function 
whatsoever. It supports a workflow but Engines are free to simplify or use it 
in different ways so always preface a question with what Engine you are using 
or asking about.



On Jan 3, 2018, at 4:33 AM, Noelia Osés Fernández  wrote:

Hi lokotochek,

You mentioned that it wasn't necessary to redeploy after retraining. However, 
today I have come across a PIO wepage that I hadn't seen before that tells me 
to redeploy after retraining (section 'Update Model with New Data'):

http://predictionio.incubator.apache.org/deploy/ 


Particularly, this page suggests adding the following line to the crontab to 
retrain every day:

0 0 * * *   $PIO_HOME/bin/pio train; $PIO_HOME/bin/pio deploy


Here it is clear that it is redeploying after retraining. So does it not 
actually hot-swap the model? Or the UR does but this page is more general for 
other templates 
that might not do that?

Thank for your help!



On 14 December 2017 at 15:57, Александр Лактионов > wrote:
Hi Noelia,
you dont have to redeploy your app after train. It will be hot-swapped and the 
previous procces (ran by pio deploy) will change recommendations automatically
> 14 дек. 2017 г., в 17:56, Noelia Osés Fernández  > написал(а):
> 
> Hi,
> 
> The first time after reboot that I train and deploy my PIO app everything 
> works well. However, if I then retrain and deploy again, I get the following 
> error: 
> 
> [INFO] [MasterActor] Undeploying any existing engine instance at 
> http://0.0.0.0:8000 
> [ERROR] [MasterActor] Another process might be occupying 0.0.0.0:8000 
> . Unable to undeploy.
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [MasterActor] Bind failed. Retrying... (2 more trial(s))
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [ERROR] [MasterActor] Bind failed. Retrying... (1 more trial(s))
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [MasterActor] Bind failed. Retrying... (0 more trial(s))
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [MasterActor] Bind failed. Shutting down.
> 
> I thought it was possible to retrain an app that was running and then deploy 
> again.
> Is this not possible?
> 
> How can I kill the running instance?
> I've tried the trick in handmade's integration test but it doesn't work:
> 
> deploy_pid=`jps -lm | grep "onsole deploy" | cut -f 1 -d ' '`
> echo "Killing the deployed test PredictionServer"
> kill "$deploy_pid"
> 
> I still get the same error after doing this.
> 
> Any help is much appreciated.
> Best regards,
> Noelia
> 
> 
> 
> 
> 
> 
> 





Re: App still returns results after pio app data-delete

2018-01-02 Thread Pat Ferrel
BTW there is a new Chrome extension that lets you browse ES and create any JSON 
query. Just found it myself after Sense stopped working in Chrome. Try 
ElasticSearch Head, found in the Chrome store.


On Jan 2, 2018, at 9:53 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

Have a look at the ES docs on their site. There are several ways, from sending 
a JSON command to deleting the data directory depending on how clean you want 
ES to be.

In general my opinion is that PIO is an integration framework for several 
services and for the majority of applications you will not need to deal 
directly with the services except for setup. This may be an exception. In all 
cases you may find it best to seek guidance from the support communities or 
docs of those services.

If you are sending a REST JSON command directive it would be as shown here: 
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-delete-index.html
 
<https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-delete-index.html>

$ curl -XDELETE 'http://localhost:9200// 
<http://localhost:9200/%3Cindex_name%3E/>'

The Index name is named in the UR engine.json or in pio-env depending on which 
index you want to delete.


On Jan 2, 2018, at 12:22 AM, Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:

Thanks for the explanation!

How do I delete the ES index? is it just DELETE /my_index_name?

Happy New Year!

On 22 December 2017 at 19:42, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
With PIO the model is managed by the user, not PIO. The input is separate and 
can be deleted without affecting the model.

Each Engine handles model’s it’s own way but most use the model storage in 
pio-env. So deleting those will get rid of the model. The UR keeps the model in 
ES under the “indexName” and “typeName” in engine.json. So you need to delete 
the index if you want to stop queries from working. The UR maintain’s one live 
copy of the model and removes old ones after a new one is made live so there 
will only ever be one model (unless you have changed your indexName often)


On Dec 21, 2017, at 4:58 AM, Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:

Hi all!

I have executed a pio app data-delete MyApp. The command has outputted the 
following:

[INFO] [HBLEvents] Removing table pio_event:events_4...
[INFO] [App$] Removed Event Store for the app ID: 4
[INFO] [HBLEvents] The table pio_event:events_4 doesn't exist yet. Creating 
now...
[INFO] [App$] Initialized Event Store for the app ID: 4


However, I executed

curl -H "Content-Type: application/json" -d '
{
}' http://localhost:8000/queries.json <http://localhost:8000/queries.json>

after deleting the data and I still get the same results as before deleting the 
data. Why is this happening?

I expected to get either an error message or an empty result like 
{"itemScores":[]}.

Any help is much appreciated.
Best regards,
Noelia




-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher | Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and Industrial Processes | Inteligencia de Datos 
para Energía y Procesos Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>



Re: App still returns results after pio app data-delete

2018-01-02 Thread Pat Ferrel
Have a look at the ES docs on their site. There are several ways, from sending 
a JSON command to deleting the data directory depending on how clean you want 
ES to be.

In general my opinion is that PIO is an integration framework for several 
services and for the majority of applications you will not need to deal 
directly with the services except for setup. This may be an exception. In all 
cases you may find it best to seek guidance from the support communities or 
docs of those services.

If you are sending a REST JSON command directive it would be as shown here: 
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-delete-index.html
 
<https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-delete-index.html>

$ curl -XDELETE 'http://localhost:9200//'

The Index name is named in the UR engine.json or in pio-env depending on which 
index you want to delete.


On Jan 2, 2018, at 12:22 AM, Noelia Osés Fernández <no...@vicomtech.org> wrote:

Thanks for the explanation!

How do I delete the ES index? is it just DELETE /my_index_name?

Happy New Year!

On 22 December 2017 at 19:42, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
With PIO the model is managed by the user, not PIO. The input is separate and 
can be deleted without affecting the model.

Each Engine handles model’s it’s own way but most use the model storage in 
pio-env. So deleting those will get rid of the model. The UR keeps the model in 
ES under the “indexName” and “typeName” in engine.json. So you need to delete 
the index if you want to stop queries from working. The UR maintain’s one live 
copy of the model and removes old ones after a new one is made live so there 
will only ever be one model (unless you have changed your indexName often)


On Dec 21, 2017, at 4:58 AM, Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:

Hi all!

I have executed a pio app data-delete MyApp. The command has outputted the 
following:

[INFO] [HBLEvents] Removing table pio_event:events_4...
[INFO] [App$] Removed Event Store for the app ID: 4
[INFO] [HBLEvents] The table pio_event:events_4 doesn't exist yet. Creating 
now...
[INFO] [App$] Initialized Event Store for the app ID: 4


However, I executed

curl -H "Content-Type: application/json" -d '
{
}' http://localhost:8000/queries.json <http://localhost:8000/queries.json>

after deleting the data and I still get the same results as before deleting the 
data. Why is this happening?

I expected to get either an error message or an empty result like 
{"itemScores":[]}.

Any help is much appreciated.
Best regards,
Noelia




-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher | Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and Industrial Processes | Inteligencia de Datos 
para Energía y Procesos Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>


Re: Recommendation return score more than 5

2017-12-22 Thread Pat Ferrel
I did not write the template you are using. I am trying to explain what the 
template should be doing and how ALS works. I’m sure that with exactly the same 
data you should get the same results but in real life you will need to 
understand the algorithm a little deeper and so the pointer to the code that is 
being executed by the template from Spark MLlib.  If this is not helpful please 
ignore the advice.


On Dec 22, 2017, at 11:16 AM, GMAIL <babaevka...@gmail.com> wrote:

But I strictly followed the instructions from the site and did not change 
anything even. Everything I did was steps from this page. I did not perform any 
additional operations, including editing the source code.

Instruction (Quick Start - Recommendation Engine Template): 
http://predictionio.incubator.apache.org/templates/recommendation/quickstart/ 
<http://predictionio.incubator.apache.org/templates/recommendation/quickstart/>

2017-12-22 22:12 GMT+03:00 Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>>:
Implicit means you assign a score to the event based on your own guess. 
Explicit uses ratings the user makes. One score is a guess by you (like a 4 for 
buy) and the other is a rating made by the user. ALS comes in 2 flavors, one 
for explicit scoring, used to predict rating and the other for implicit scoring 
used to predict something the user will prefer. 

Make sure your template is using the explicit version of ALS. 
https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html#explicit-vs-implicit-feedback
 
<https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html#explicit-vs-implicit-feedback>


On Dec 21, 2017, at 11:09 PM, GMAIL <babaevka...@gmail.com 
<mailto:babaevka...@gmail.com>> wrote:

I wanted to use the Recomender because I expected that it could predict the 
scores as it is done by MovieLens. And it seems to be doing so, but for some 
reason the input and output scale is different. In imported scores, from 1 to 
5, and in the predicted from 1 to 10.

If by implicit scores you mean events without parameters, then I am aware that 
in essence there is also an score. I watched the DataSource in Recommender and 
there were only two events: rate and buy. Rate takes an score, and the buy 
implicitly puts the rating at 4 (out of 5, as I think).

And I still did not understand exactly where to look for me and what to 
correct, so that incoming and predicted estimates were on the same scale.

2017-12-19 4:10 GMT+03:00 Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>>:
There are 2 types of MLlib ALS recommenders last I checked, implicit and 
explicit. Implicit ones you give any arbitrary score, like a 1 for purchase. 
The explicit one you can input ratings and it is expected to predict ratings 
for an individual. But both iirc also have a regularization parameter that 
affects the scoring and is a param so you have to experiment with it using 
cross-validation to see where you get the best results.

There is an old metric used for this type of thing called RMSE 
(root-mean-square error) which, when minimized will give you scores that most 
closely match actual scores in the hold-out set (see wikipedia on 
cross-validation and RMSE). You may have to use explicit ALS and tweak the 
regularization param, to get the lowest RMSE. I doubt anything will guarantee 
them to be in exactly the range of ratings so you’ll then need to pick the 
closest rating.


On Dec 18, 2017, at 10:42 AM, GMAIL <babaevka...@gmail.com 
<mailto:babaevka...@gmail.com>> wrote:

That is, the predicted scores that the Recommender returns can not just be 
multiplied by two, but may be completely wrong? 
I can not, say, just divide the predictions by 2 and pretend that everything is 
fine?

2017-12-18 21:35 GMT+03:00 Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>>:
The UR and the Recommendations Template use very different technology 
underneath. 

In general the scores you get from recommenders are meaningless on their own. 
When using ratings as numerical values with a ”Matrix Factorization” 
recommender like the ones in MLlib, upon which the Recommendations Template is 
based need to have a regularization parameter. I don’t know for sure but maybe 
this is why the results don’t come in the range of input ratings. I haven’t 
looked at the code in a long while.

If you are asking about the UR it would not take numeric ratings and the scores 
cannot be compared to them.

For many reasons that I have written about before I always warn people about 
using ratings, which have been discontinued as a source of input for Netflix 
(who have removed them from their UX) and many other top recommender users. 
There are many reasons for this, not the least of which is that they are 
ambiguous and don’t directly relate to whether a user might like an item. For 
instance most video sources now use something 

Re: How to import item properties dynamically?

2017-12-22 Thread Pat Ferrel
The properties go into the Event Store immediately but you have to train to get 
them into the model, this assuming your template support item properties. If yo 
uare using the UR, the properties will not get into the model until the next 
`pio train…`


On Dec 22, 2017, at 3:37 AM, Noelia Osés Fernández  wrote:


Hi all,

I have a pio app and I need to update item properties regularly. However, not 
all items will have all properties always. So I want to update the properties 
dynamically doing something similiar to the following:

# create properties json
propertiesjson = '{'
if "tiempo" in dfcolumns:
propertiesjson = propertiesjson + '"tiempo": ' + 
str(int(plan.tiempo))
if "duracion" in dfcolumns:
propertiesjson = propertiesjson + ', "duracion": ' + 
str(plan.duracion)
propertiesjson = propertiesjson + '}'

# add event
client.create_event(
event="$set",
entity_type="item",
entity_id=plan.id_product,
properties=json.dumps(propertiesjson)
)


However, this results in an error message:


Traceback (most recent call last):
  File "import_itemproperties.py", line 110, in 
import_events(client, args.dbuser, args.dbpasswd, args.dbhost, args.dbname)
  File "import_itemproperties.py", line 73, in import_events
properties=json.dumps(propertiesjson)
  File 
"/home/ubuntu/.local/lib/python2.7/site-packages/predictionio/__init__.py", 
line 255, in create_event
event_time).get_response()
  File 
"/home/ubuntu/.local/lib/python2.7/site-packages/predictionio/connection.py", 
line 111, in get_response
self._response = self.rfunc(tmp_response)
  File 
"/home/ubuntu/.local/lib/python2.7/site-packages/predictionio/__init__.py", 
line 130, in _acreate_resp
response.body))
predictionio.NotCreatedError: request: POST 
/events.json?accessKey=0Hys1qwfgo3vF16jElBDJJnSLmrkN5Tg86qAPqepYPK_-lXMqI4NMjLXaBGgQJ4U
 {'entityId': 8, 'entityType': 'item', 'properties': '"{\\"tiempo\\": 2, 
\\"duracion\\": 60}"', 'event': '$set', 'eventTime': 
'2017-12-22T11:29:59.762+'} 
/events.json?accessKey=0Hys1qwfgo3vF16jElBDJJnSLmrkN5Tg86qAPqepYPK_-lXMqI4NMjLXaBGgQJ4U?entityId=8=item=%22%7B%5C%22tiempo%5C%22%3A+2%2C+%5C%22duracion%5C%22%3A+60%2C=%24set=2017-12-22T11%3A29%3A59.762%2B
 status: 400 body: {"message":"org.json4s.package$MappingException: Expected 
object but got JString(\"{\\\"tiempo\\\": 2, \\\"duracion\\\": 60}\")"}


Any help is much appreciated!
Season's greetings!
Noelia






Re: Recommendation return score more than 5

2017-12-22 Thread Pat Ferrel
Implicit means you assign a score to the event based on your own guess. 
Explicit uses ratings the user makes. One score is a guess by you (like a 4 for 
buy) and the other is a rating made by the user. ALS comes in 2 flavors, one 
for explicit scoring, used to predict rating and the other for implicit scoring 
used to predict something the user will prefer. 

Make sure your template is using the explicit version of ALS. 
https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html#explicit-vs-implicit-feedback
 
<https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html#explicit-vs-implicit-feedback>

On Dec 21, 2017, at 11:09 PM, GMAIL <babaevka...@gmail.com> wrote:

I wanted to use the Recomender because I expected that it could predict the 
scores as it is done by MovieLens. And it seems to be doing so, but for some 
reason the input and output scale is different. In imported scores, from 1 to 
5, and in the predicted from 1 to 10.

If by implicit scores you mean events without parameters, then I am aware that 
in essence there is also an score. I watched the DataSource in Recommender and 
there were only two events: rate and buy. Rate takes an score, and the buy 
implicitly puts the rating at 4 (out of 5, as I think).

And I still did not understand exactly where to look for me and what to 
correct, so that incoming and predicted estimates were on the same scale.

2017-12-19 4:10 GMT+03:00 Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>>:
There are 2 types of MLlib ALS recommenders last I checked, implicit and 
explicit. Implicit ones you give any arbitrary score, like a 1 for purchase. 
The explicit one you can input ratings and it is expected to predict ratings 
for an individual. But both iirc also have a regularization parameter that 
affects the scoring and is a param so you have to experiment with it using 
cross-validation to see where you get the best results.

There is an old metric used for this type of thing called RMSE 
(root-mean-square error) which, when minimized will give you scores that most 
closely match actual scores in the hold-out set (see wikipedia on 
cross-validation and RMSE). You may have to use explicit ALS and tweak the 
regularization param, to get the lowest RMSE. I doubt anything will guarantee 
them to be in exactly the range of ratings so you’ll then need to pick the 
closest rating.


On Dec 18, 2017, at 10:42 AM, GMAIL <babaevka...@gmail.com 
<mailto:babaevka...@gmail.com>> wrote:

That is, the predicted scores that the Recommender returns can not just be 
multiplied by two, but may be completely wrong? 
I can not, say, just divide the predictions by 2 and pretend that everything is 
fine?

2017-12-18 21:35 GMT+03:00 Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>>:
The UR and the Recommendations Template use very different technology 
underneath. 

In general the scores you get from recommenders are meaningless on their own. 
When using ratings as numerical values with a ”Matrix Factorization” 
recommender like the ones in MLlib, upon which the Recommendations Template is 
based need to have a regularization parameter. I don’t know for sure but maybe 
this is why the results don’t come in the range of input ratings. I haven’t 
looked at the code in a long while.

If you are asking about the UR it would not take numeric ratings and the scores 
cannot be compared to them.

For many reasons that I have written about before I always warn people about 
using ratings, which have been discontinued as a source of input for Netflix 
(who have removed them from their UX) and many other top recommender users. 
There are many reasons for this, not the least of which is that they are 
ambiguous and don’t directly relate to whether a user might like an item. For 
instance most video sources now use something like the length of time a user 
watches a video, and review sites prefer “like” and “dislike”. The first is 
implicit and the second is quite unambiguous. 


On Dec 18, 2017, at 12:32 AM, GMAIL <babaevka...@gmail.com 
<mailto:babaevka...@gmail.com>> wrote:

Does it seem to me or UR strongly differs from Recommender?
At least I can't find method getRatings in class DataSource, which contains all 
events, in particular, "rate", that I needed.

2017-12-18 11:14 GMT+03:00 Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>>:
I didn't solve the problem :(

Now I use the universal recommender

On 18 December 2017 at 09:12, GMAIL <babaevka...@gmail.com 
<mailto:babaevka...@gmail.com>> wrote:
And how did you solve this problem? Did you divide prediction score by 2?

2017-12-18 10:40 GMT+03:00 Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>>:
I got the same problem. I still don't know the answer to your question :(

On 17 December 2017 at 14:07, GMAIL <babaevka...@gmail.

Re: Recommended Configuration

2017-12-15 Thread Pat Ferrel
That is enough for a development machine and may work if you data is relatively 
small but for big data clusters of CPU with a fair amount of RAM and Storage 
are required. The telling factor is partly how big your data is but also how is 
combines to form models, which will depend on which recommender you are using. 

We usually build big clusters to analyze the data then downsize them when we 
see how much is needed. if you have small data, < 1m events, you may try a 
single machine. 


On Dec 15, 2017, at 3:59 AM, GMAIL  wrote:

Hi. 
Could you tell me recommended configuration for comfort work PredictionIO 
Recommender Template. 
I read that I need 16Gb RAM, but what about the rest (CPU/Storage/GPU(?))? 

P.S. sorry for my English.



Re: New Website

2017-12-13 Thread Pat Ferrel
I guess I’m ok with that since the overall site is such a huge improvement but 
please don’t go back to the old logo for the launch, the color schemes don’t 
match and that will ruin the effect of the new design. If you ask 
startbootstrap I bet they agree.

Ship it, there will be lots of changes later, if only content updates.
+1 

On Dec 13, 2017, at 12:38 PM, Andrew Palumbo <ap@outlook.com> wrote:

I apologize.. I've been in a back to back meetings all week.. so am hectic..but 
as far as separating the vote, my thinking is just ship site as is and then 
swap out logo if we have -1s on it.



Sent from my Verizon Wireless 4G LTE smartphone


 Original message 
From: Andrew Palumbo <ap@outlook.com>
Date: 12/13/2017 12:35 (GMT-08:00)
To: dev@mahout.apache.org
Subject: RE: New Website

I am +1 on the site absolutely.  I suggest that we seperate the vote  on the 
logo and the site.


Sent from my Verizon Wireless 4G LTE smartphone


 Original message --------
From: Pat Ferrel <p...@occamsmachete.com>
Date: 12/13/2017 09:47 (GMT-08:00)
To: dev@mahout.apache.org
Subject: Re: New Website

Due to 8 years of Ruby cruft I can’t get the Jeckyll site running without some 
major jackhammering. I can’t post a screenshot but here is the proposed logo.

https://github.com/apache/mahout/blob/mahout-1981/website/assets/mahout-logo.svg
 
<https://github.com/apache/mahout/blob/mahout-1981/website/assets/mahout-logo.svg>

I encourage people to look at all of this and be judicious with -1s. This has 
been a lot of work, much of the design volunteered by folks at 
startbootstrap.com. IMO the design is awesome. It will put a good, modern, 
clean face on the new Mahout.

The logo is a simple cube, not my favorite but I’m not going to -1 my favorite 
was the M/infinity symbol. If the logo is meant to be a hypercube there are 
simple ways to illustrate it like some form of this:

https://sarcasticresonance.files.wordpress.com/2017/01/cubes1.png?w=721=2 
<https://sarcasticresonance.files.wordpress.com/2017/01/cubes1.png?w=721=2>


On Dec 6, 2017, at 11:27 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

Since you’ve already built it can you share a screen shot? The mockup I saw on 
Slack looked awesome.

Also a logo change is a lot more far reaching so can we have at least a little 
discussion?


On Dec 6, 2017, at 10:18 AM, Andrew Musselman <andrew.mussel...@gmail.com> 
wrote:

+1, looks great

On Wed, Dec 6, 2017 at 7:43 AM, Trevor Grant <trevor.d.gr...@gmail.com>
wrote:

> Hey all,
> 
> The new website is available by checking out the mahout-1981 branch.
> 
> If anyone interested wants to help do QA on it-
> 
> 
> Follow these instructions
> https://github.com/apache/mahout/blob/mahout-1981/
> website/developers/how-to-update-the-website.md
> 
> The only difference is, until we merge- you need to checkout mahout-1981 to
> see the new site.
> 
> I've been working on getting all of the links working /etc.
> 
> Would like to plan on launching Monday, if no objections. That gives
> everyone a chance have a look.
> 
> Also, even if a typo or broken link slips through- updating the website is
> easier than ever for committers and contributors alike (after we launch the
> new site).  One simply opens a PR against master, and then when merged, the
> site automatically updates!
> 
> Thanks,
> tg
> 





Re: New Website

2017-12-13 Thread Pat Ferrel
Due to 8 years of Ruby cruft I can’t get the Jeckyll site running without some 
major jackhammering. I can’t post a screenshot but here is the proposed logo.

https://github.com/apache/mahout/blob/mahout-1981/website/assets/mahout-logo.svg
 
<https://github.com/apache/mahout/blob/mahout-1981/website/assets/mahout-logo.svg>

I encourage people to look at all of this and be judicious with -1s. This has 
been a lot of work, much of the design volunteered by folks at 
startbootstrap.com. IMO the design is awesome. It will put a good, modern, 
clean face on the new Mahout.

The logo is a simple cube, not my favorite but I’m not going to -1 my favorite 
was the M/infinity symbol. If the logo is meant to be a hypercube there are 
simple ways to illustrate it like some form of this:

https://sarcasticresonance.files.wordpress.com/2017/01/cubes1.png?w=721=2 
<https://sarcasticresonance.files.wordpress.com/2017/01/cubes1.png?w=721=2>


On Dec 6, 2017, at 11:27 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

Since you’ve already built it can you share a screen shot? The mockup I saw on 
Slack looked awesome.

Also a logo change is a lot more far reaching so can we have at least a little 
discussion?


On Dec 6, 2017, at 10:18 AM, Andrew Musselman <andrew.mussel...@gmail.com> 
wrote:

+1, looks great

On Wed, Dec 6, 2017 at 7:43 AM, Trevor Grant <trevor.d.gr...@gmail.com>
wrote:

> Hey all,
> 
> The new website is available by checking out the mahout-1981 branch.
> 
> If anyone interested wants to help do QA on it-
> 
> 
> Follow these instructions
> https://github.com/apache/mahout/blob/mahout-1981/
> website/developers/how-to-update-the-website.md
> 
> The only difference is, until we merge- you need to checkout mahout-1981 to
> see the new site.
> 
> I've been working on getting all of the links working /etc.
> 
> Would like to plan on launching Monday, if no objections. That gives
> everyone a chance have a look.
> 
> Also, even if a typo or broken link slips through- updating the website is
> easier than ever for committers and contributors alike (after we launch the
> new site).  One simply opens a PR against master, and then when merged, the
> site automatically updates!
> 
> Thanks,
> tg
> 




Re: New Website

2017-12-06 Thread Pat Ferrel
Since you’ve already built it can you share a screen shot? The mockup I saw on 
Slack looked awesome.

Also a logo change is a lot more far reaching so can we have at least a little 
discussion?


On Dec 6, 2017, at 10:18 AM, Andrew Musselman  
wrote:

+1, looks great

On Wed, Dec 6, 2017 at 7:43 AM, Trevor Grant 
wrote:

> Hey all,
> 
> The new website is available by checking out the mahout-1981 branch.
> 
> If anyone interested wants to help do QA on it-
> 
> 
> Follow these instructions
> https://github.com/apache/mahout/blob/mahout-1981/
> website/developers/how-to-update-the-website.md
> 
> The only difference is, until we merge- you need to checkout mahout-1981 to
> see the new site.
> 
> I've been working on getting all of the links working /etc.
> 
> Would like to plan on launching Monday, if no objections. That gives
> everyone a chance have a look.
> 
> Also, even if a typo or broken link slips through- updating the website is
> easier than ever for committers and contributors alike (after we launch the
> new site).  One simply opens a PR against master, and then when merged, the
> site automatically updates!
> 
> Thanks,
> tg
> 



Re: User features to tailor recs in UR queries?

2017-12-05 Thread Pat Ferrel
The User’s possible indicators of taste are encoded in the usage data. Gender 
and other “profile" type data can be encoded a (user-id, gender, gender-id) but 
this is used and a secondary indicator, not as a filter. Only item properties 
are used a filters for some very practical reasons. For one thing items are 
what you are recommending so you would have to establish some relationship 
between items and gender of buyers. The UR does this with user data in 
secondary indicators but does not filter by these because they are calculated 
properties, not ones assigned by humans, like “in-stock” or “language”

Location is an easy secondary indicator but needs to be encoded with “areas” 
not lat/lon, so something like (user-id, location-of-purchase, 
country-code+postal-code) This would be triggered when a primary event happens, 
such as a purchase. This way locaiton is accounted for in making 
recommendations without your haveing to do anything but feed in the data.

Lat/lon roximity filters are not implemented but possible.

One thing to note is that fields used to filter or boost are very different 
than user taste indicators. For one thing they are never tested for correlation 
with the primary event (purchase, read, watch,…) so they can be very dangerous 
to use unwisely. They are best used for business rules like only show 
“in-stock” or in this video carousel show only video of the “mystery” genre. 
But if you use user profile data to filter recommendation you can distort what 
is returned and get bad results. We once had a client that waanted to do this 
against out warnings, filtering by location, gender, and several other things 
known about the user and got 0 lift in sales. We convinced they to try without 
the “business rules” and got good lift in sales. User taste indicators are best 
left to the correlation test by inputting them as user indicator data—except 
where you purposely want to reduce the recommendations to a subset for a 
business reason.

Piut more simply, business rules can kill the value of a recommender, let it 
figure out whether and indicator matters. And always remember that indicators 
apply to users, filters and boosts apply to items and known properties of 
items. It may seem like genre is both a user taste indicator and an item 
property but if you input them in 2 ways they can be used in 2 ways. 1) to make 
better recommendations, 2) in business rules. They are stored and used in 
completely different ways.



On Dec 5, 2017, at 7:59 AM, Noelia Osés Fernández  wrote:

Hi all,

I have seen how to use item properties in queries to tailor the recommendations 
returned by the UR.

But I was wondering whether it is possible to use user characteristics to do 
the same. For example, I want to query for recs from the UR but only taking 
into account the history of users that are female (or only using the history of 
users in the same county). Is this possible to do?

I've been reading the UR docs but couldn't find info about this.

Thank you very much!

Best regards,
Noelia

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com 
.
To post to this group, send email to actionml-u...@googlegroups.com 
.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/CAMysefu-8mOgh3NsRkRVN6H6bRm6hR%2B1HuryT4wqgtXZD3norg%40mail.gmail.com
 
.
For more options, visit https://groups.google.com/d/optout 
.



[jira] [Assigned] (MAHOUT-2023) Drivers broken, scopt classes not found

2017-11-27 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel reassigned MAHOUT-2023:
--

Assignee: Trevor Grant  (was: Pat Ferrel)

> Drivers broken, scopt classes not found
> ---
>
> Key: MAHOUT-2023
> URL: https://issues.apache.org/jira/browse/MAHOUT-2023
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: any
>Reporter: Pat Ferrel
>Assignee: Trevor Grant
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Type `mahout spark-itemsimilarity` after Mahout is installed properly and you 
> get a fatal exception due to missing scopt classes.
> Probably a build issue related to incorrect versions of scopt being looked 
> for.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MAHOUT-2023) Drivers broken, scopt classes not found

2017-11-27 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267877#comment-16267877
 ] 

Pat Ferrel commented on MAHOUT-2023:


This is a big issue. It shows up when you run a Spark CLI but also seems to 
affect GPU bindings written in Scala, disabling both. The fix is somewhere in 
the build system afaict.

> Drivers broken, scopt classes not found
> ---
>
> Key: MAHOUT-2023
> URL: https://issues.apache.org/jira/browse/MAHOUT-2023
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: any
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Type `mahout spark-itemsimilarity` after Mahout is installed properly and you 
> get a fatal exception due to missing scopt classes.
> Probably a build issue related to incorrect versions of scopt being looked 
> for.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: Prepping Release

2017-11-27 Thread Pat Ferrel
https://issues.apache.org/jira/browse/MAHOUT-2023 
 is the only blocker I see. 
It’s a big one since it make drivers and GPU bindings not work in clusters (I 
think). But the fix is probably easy.


On Nov 27, 2017, at 8:06 AM, Jim Jagielski  wrote:

Looks good to me! Thx!

> On Nov 26, 2017, at 11:59 AM, Trevor Grant  wrote:
> 
> Hey all-
> 
> Making another run at prepping a 0.13.1 Release.
> 
> Please see
> https://issues.apache.org/jira/projects/MAHOUT/versions/12339149
> 
> If anyone has any other issues they think need to be addressed before
> 0.13.1 please make sure the "Affects Version" On the JIRA ticket is
> correctly set, and list "type" as a blocker.
> 
> I think most of the issues on that list are more or less taken care of,
> assuming no other blockers sneak up, will be calling "code freeze" mid
> week.
> 
> Thanks!
> tg




[jira] [Resolved] (MAHOUT-2020) Maven repo structure malformed

2017-11-27 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel resolved MAHOUT-2020.

Resolution: Fixed

Trevor found a script in Spark that seems to fix this when used during a build. 
Marking as fixed but we need to document this for source builds.

> Maven repo structure malformed
> --
>
> Key: MAHOUT-2020
> URL: https://issues.apache.org/jira/browse/MAHOUT-2020
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: Creating a project from maven built Mahout using sbt. 
> Made critical since it seems to block using Mahout with sbt. At least I have 
> found no way to do it.
>Reporter: Pat Ferrel
>Assignee: Trevor Grant
>Priority: Blocker
> Fix For: 0.13.1
>
>
> The maven repo is built with scala 2.10 always in the parent pom's 
> {scala.compat.version} even when you only ask for Scala 2.11, this leads to 
> the 2.11 jars never being found. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: Log-likelihood based correlation test?

2017-11-23 Thread Pat Ferrel
Use the default. Tuning with a threshold is only for atypical data and unless 
you have a harness for cross-validation you would not know if you were making 
things worse or better. We have our own tools for this but have never had the 
need for threshold tuning. 

Yes, llrDownsampled(PtP) is the “model”, each doc put into Elasticsearch is a 
sparse representation of a row from it, along with those from PtV, PtC,… Each 
gets a “field” in the doc.


On Nov 22, 2017, at 6:16 AM, Noelia Osés Fernández <no...@vicomtech.org> wrote:

Thanks Pat!

How can I tune the threshold?

And when you say "compare to each item in the model", do you mean each row in 
PtP?

On 21 November 2017 at 19:56, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
No PtP non-zero elements have LLR calculated. The highest scores in the row are 
kept, or ones above some threshold hte resst are removeda as “noise". These are 
put into the Elasticsearch model without scores. 

Elasticsearch compares the similarity of the user history to each item in the 
model to find the KNN similar ones. This uses OKAPI BM25 from Lucene, which has 
several benefits over pure cosines (it actually consists of adjustments to 
cosine) and we also use norms. With ES 5 we should see quality improvements due 
to this. 
https://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html
 
<https://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html>



On Nov 21, 2017, at 1:28 AM, Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:

Pat,

If I understood your explanation correctly, you say that some elements of PtP 
are removed by the LLR (set to zero, to be precise). But the elements that 
survive are calculated by matrix multiplication. The final PtP is put into 
EleasticSearc and when we query for user recommendations ES uses KNN to find 
the items (the rows in PtP) that are most similar to the user's history.

If the non-zero elements of PtP have been calculated by straight matrix 
multiplication, and I'm assuming that the P matrix only has 0s and 1s to 
indicate which items have been purchased by which user, then the elements of 
PtP are either 0 or greater to or equal than 1. However, the scores I get are 
below 1.

So is the KNN using cosine similarity as a metric to calculate the closest 
neighbours? And is the results of this cosine similarity metric what is 
returned as a 'score'?

If it is, when it is greater than 1, is this because the different cosine 
similarities are added together i.e. PtP, PtL... ?

Thank you for all your valuable help!

On 17 November 2017 at 19:52, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
Mahout builds the model by doing matrix multiplication (PtP) then calculating 
the LLR score for every non-zero value. We then keep the top K or use a 
threshold to decide whether to keep of not (both are supported in the UR). LLR 
is a metric for seeing how likely 2 events in a large group are correlated. 
Therefore LLR is only used to remove weak data from the model.

So Mahout builds the model then it is put into Elasticsearch which is used as a 
KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only 
an indicator that the item survived the LLR test.

The KNN is applied using the user’s history as the query and finding items the 
most closely match it. Since PtP will have items in rows and the row will have 
correlating items, this “search” methods work quite well to find items that had 
very similar items purchased with it as are in the user’s history.

=== that is the simple explanation 


Item-based recs take the model items (correlated items by the LLR test) as the 
query and the results are the most similar items—the items with most similar 
correlating items.

The model is items in rows and items in columns if you are only using one 
event. PtP. If you think it through, it is all purchased items in as the row 
key and other items purchased along with the row key. LLR filters out the 
weakly correlating non-zero values (0 mean no evidence of correlation anyway). 
If we didn’t do this it would be purely a “Cooccurrence” recommender, one of 
the first useful ones. But filtering based on cooccurrence strength (PtP values 
without LLR applied to them) produces much worse results than using LLR to 
filter for most highly correlated cooccurrences. You get a similar effect with 
Matrix Factorization but you can only use one type of event for various reasons.

Since LLR is a probabilistic metric that only looks at counts, it can be 
applied equally well to PtV (purchase, view), PtS (purchase, search terms), PtC 
(purchase, category-preferences). We did an experiment using Mean Average 
Precision for the UR using video “Likes” vs “Likes” and “Dislikes

[jira] [Commented] (MAHOUT-2023) Drivers broken, scopt classes not found

2017-11-20 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259971#comment-16259971
 ] 

Pat Ferrel commented on MAHOUT-2023:


Yep, the mahout...dependency-reduced.jar excludes anything with 
{{scala.compat.version}} in the name

> Drivers broken, scopt classes not found
> ---
>
> Key: MAHOUT-2023
> URL: https://issues.apache.org/jira/browse/MAHOUT-2023
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: any
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Type `mahout spark-itemsimilarity` after Mahout is installed properly and you 
> get a fatal exception due to missing scopt classes.
> Probably a build issue related to incorrect versions of scopt being looked 
> for.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MAHOUT-2023) Drivers broken, scopt classes not found

2017-11-20 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258255#comment-16258255
 ] 

Pat Ferrel edited comment on MAHOUT-2023 at 11/20/17 10:40 PM:
---

ok not that MAHOUT-2020 is resolved, I looked at the scopt issue and found:
* all the correct scopt artifact exist in remote repos for all scala versions 
and they are being found by the mahout build.
* the ids for artifact etc are correct as per ^^^
* I checked all the tagged versions of Mahout back to 12.0. Not sure when the 
drivers stopped working but there has been no change to any reference to scopt 
in any POM. And since people have been using it and asking questions on the 
mailing list I will assume that up till the last build changes the drivers 
worked.
* The vienna-cl and java to c bindings are in the assembly pom so these classes 
are getting to the Spark Executors.
* I've checked compute-classpath.sh and the mahout script where changes were 
small and not relevant. 
* I've looked at the contents of the mahout*dependency-reduced.jar, which 
should have the things listed below and it does not, in only had guava, apache 
commons and fastutils. It is supposed to have:

{code:xml}

  

  true
  
  
  
 META-INF/LICENSE
  
  
  runtime
  /
  true
  

com.google.guava:guava
com.github.scopt_${scala.compat.version}
com.tdunning:t-digest
org.apache.commons:commons-math3
it.unimi.dsi:fastutil

org.apache.mahout:mahout-native-viennacl_${scala.compat.version}

org.apache.mahout:mahout-native-viennacl-omp_${scala.compat.version}
org.bytedeco:javacpp
  




{code}  

This all leads me to believe that something in the build no longer makes that 
dependency-reduced.jar available to the Java Driver code since those other libs 
in the assembly are probably all hadoop or Spark Executor code, not needed in 
the Mahout driver. This is likely to have been a side effect of the build 
refactoring

[~rawkintrevo_apache] does "dependencies-reduced.jar" which contains Scopt get 
its scala.compat.version fixed? It seems like the jar is missing anything with 
scala.compat.version but this may be a red herring.




was (Author: pferrel):
ok not that MAHOUT-2020 is resolved, I looked at the scopt issue and found:
* all the correct scopt artifact exist in remote repos for all scala versions 
and they are being found by the mahout build.
* the ids for artifact etc are correct as per ^^^
* I checked all the tagged versions of Mahout back to 12.0. Not sure when the 
drivers stopped working but there has been no change to any reference to scopt 
in any POM. And since people have been using it and asking questions on the 
mailing list I will assume that up till the last build changes the drivers 
worked.
* The vienna-cl and java to c bindings are in the assembly pom so these classes 
are getting to the Spark Executors.
* I've checked compute-classpath.sh and the mahout script where changes were 
small and not relevant. 
* I've looked at the contents of the mahout*dependency-reduced.jar, which 
should have the things listed below and it does not, in only had guava, apache 
commons and fastutils. It is supposed to have:

  

  true
  
  
  
 META-INF/LICENSE
  
  
  runtime
  /
  true
  

com.google.guava:guava
com.github.scopt_${scala.compat.version}
com.tdunning:t-digest
org.apache.commons:commons-math3
it.unimi.dsi:fastutil

org.apache.mahout:mahout-native-viennacl_${scala.compat.version}

org.apache.mahout:mahout-native-viennacl-omp_${scala.compat.version}
org.bytedeco:javacpp
  

  

This all leads me to believe that something in the build no longer makes that 
dependency-reduced.jar available to the Java Driver code since those other libs 
in the assembly are probably all hadoop or Spark Executor code, not needed in 
the Mahout driver. This is likely to have been a side effect of the build 
refactoring

[~rawkintrevo_apache] does "dependencies-reduced.jar" which contains Scopt get 
its scala.compat.version fixed? It seems like the jar is missing anything with 
scala.compat.version but this may be a red herring.



> Drivers broken, scopt classes not found
> ---
>
> Key: MAHOUT-2023
> URL: https://issues.apache.org/jira/browse/MAHOUT-2023
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
>     Environment: any
>        Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Type `mahout spark-it

Re: Log-likelihood based correlation test?

2017-11-20 Thread Pat Ferrel
Yes, this will show the model. But if you do this a lot there are tools like 
Restlet that you plug in to Chrome. They will allow you to build queries of all 
sorts. For instance 
GET http://localhost:9200/urindex/_search?pretty 

will show the item rows of the UR model put into the index for the integration 
test data. The UI is a bit obtuse but you can scroll down in the right pane 
expanding bits of JSON as you go to see this:

"hits":{
"total": 7,
"max_score": 1,
"hits":[
{
"_index": "urindex_1511033890025",
"_type": "items",
"_id": "Nexus",
"_score": 1,
"_source":{
"defaultRank": 4,
"expires": "2017-11-04T19:01:23.655-07:00",
"countries":["United States", "Canada"],
"id": "Nexus",
"date": "2017-11-02T19:01:23.655-07:00",
"category-pref":["tablets"],
"categories":["Tablets", "Electronics", "Google"],
"available": "2017-10-31T19:01:23.655-07:00",
"purchase":[],
"popRank": 2,
"view":["Tablets"]
}
},

As you can see no purchased items survived the correlation test, one survived 
the view and category-pref correlation tests. The other fields are item 
properties set using $set events and are used with business rules.

 With something like this tool you can even take the query logged in the 
deployed PIO server and send it to see how the query is constructed and what 
the results are (same as you get from the SDK I’ll wager :-)



On Nov 20, 2017, at 7:07 AM, Daniel Gabrieli <dgabri...@salesforce.com> wrote:

There is a REST client for Elasticsearch and bindings in many popular languages 
but to get started quickly I found this commands helpful:

List Indices:

curl -XGET 'localhost:9200/_cat/indices?v'

Get some documents from an index:

curl -XGET 'localhost:9200//_search?q=*'

Then look at the "_source" in the document to see what values are associated 
with the document.

More info here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#_source
 
<https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#_source>

this might also be helpful to work through a single specific query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html
 
<https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html>





On Mon, Nov 20, 2017 at 9:49 AM Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:
Thanks Daniel!

And excuse my ignorance but... how do you inspect the ES index?

On 20 November 2017 at 15:29, Daniel Gabrieli <dgabri...@salesforce.com 
<mailto:dgabri...@salesforce.com>> wrote:
There is this cli tool and article with more information that does produce 
scores:

https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html 
<https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html>

But I don't know of any commands that return diagnostics about LLR from the PIO 
framework / UR engine.  That would be a nice feature if it doesn't exist.  The 
way I've gotten some insight into what the model is doing is by when using PIO 
/ UR is by inspecting the the ElasticSearch index that gets created because it 
has the "significant" values populated in the documents (though not the actual 
LLR scores).

On Mon, Nov 20, 2017 at 7:22 AM Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:
This thread is very enlightening, thank you very much!

Is there a way I can see what the P, PtP, and PtL matrices of an app are? In 
the handmade case, for example?

Are there any pio calls I can use to get these?

On 17 November 2017 at 19:52, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
Mahout builds the model by doing matrix multiplication (PtP) then calculating 
the LLR score for every non-zero value. We then keep the top K or use a 
threshold to decide whether to keep of not (both are supported in the UR). LLR 
is a metric for seeing how likely 2 events in a large group are correlated. 
Therefore LLR is only used to remove weak data from the model.

So Mahout builds the model then it is put into Elasticsearch which is used as a 
KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only 
an indicator that the item survived the LLR test.

The KNN is applied using the user’s history as the query and finding items the 
most closely match it. Since PtP will have items in rows and the row will have 
correlating items, this “search” methods work quite well to find items that had 
very similar items purchased with it as are in the user’s history.

=== th

[jira] [Commented] (MAHOUT-2023) Drivers broken, scopt classes not found

2017-11-18 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258258#comment-16258258
 ] 

Pat Ferrel commented on MAHOUT-2023:


Whoa, that is a big clue I think. Everything without a scala.compat.version is 
included in the file mahout-spark_2.10-0.13.1-SNAPSHOT-dependency-reduced.jar 
or whatever is generated for the Scala version but none of the classes that use 
scala.compat.version to resolve the classname.

Big clue but not sure where it leads [~rawkintrevo_apache] Any idea where to 
look from here?

> Drivers broken, scopt classes not found
> ---
>
> Key: MAHOUT-2023
> URL: https://issues.apache.org/jira/browse/MAHOUT-2023
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: any
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Type `mahout spark-itemsimilarity` after Mahout is installed properly and you 
> get a fatal exception due to missing scopt classes.
> Probably a build issue related to incorrect versions of scopt being looked 
> for.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MAHOUT-2023) Drivers broken, scopt classes not found

2017-11-18 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258255#comment-16258255
 ] 

Pat Ferrel edited comment on MAHOUT-2023 at 11/18/17 11:32 PM:
---

ok not that MAHOUT-2020 is resolved, I looked at the scopt issue and found:
* all the correct scopt artifact exist in remote repos for all scala versions 
and they are being found by the mahout build.
* the ids for artifact etc are correct as per ^^^
* I checked all the tagged versions of Mahout back to 12.0. Not sure when the 
drivers stopped working but there has been no change to any reference to scopt 
in any POM. And since people have been using it and asking questions on the 
mailing list I will assume that up till the last build changes the drivers 
worked.
* The vienna-cl and java to c bindings are in the assembly pom so these classes 
are getting to the Spark Executors.
* I've checked compute-classpath.sh and the mahout script where changes were 
small and not relevant. 
* I've looked at the contents of the mahout*dependency-reduced.jar, which 
should have the things listed below and it does not, in only had guava, apache 
commons and fastutils. It is supposed to have:

  

  true
  
  
  
 META-INF/LICENSE
  
  
  runtime
  /
  true
  

com.google.guava:guava
com.github.scopt_${scala.compat.version}
com.tdunning:t-digest
org.apache.commons:commons-math3
it.unimi.dsi:fastutil

org.apache.mahout:mahout-native-viennacl_${scala.compat.version}

org.apache.mahout:mahout-native-viennacl-omp_${scala.compat.version}
org.bytedeco:javacpp
  

  

This all leads me to believe that something in the build no longer makes that 
dependency-reduced.jar available to the Java Driver code since those other libs 
in the assembly are probably all hadoop or Spark Executor code, not needed in 
the Mahout driver. This is likely to have been a side effect of the build 
refactoring

[~rawkintrevo_apache] does "dependencies-reduced.jar" which contains Scopt get 
its scala.compat.version fixed? It seems like the jar is missing anything with 
scala.compat.version but this may be a red herring.




was (Author: pferrel):
ok not that MAHOUT-2020 is resolved, I looked at the scopt issue and found:
* all the correct scopt artifact exist in remote repos for all scala versions 
and they are being found by the mahout build.
* the ids for artifact etc are correct as per ^^^
* I checked all the tagged versions of Mahout back to 12.0. Not sure when the 
drivers stopped working but there has been no change to any reference to scopt 
in any POM. And since people have been using it and asking questions on the 
mailing list I will assume that up till the last build changes the drivers 
worked.
* The vienna-cl and java to c bindings are in the assembly pom so these classes 
are getting to the Spark Executors.
* I've checked compute-classpath.sh and the mahout script where changes were 
small and not relevant. 
* I've looked at the contents of the mahout*dependency-reduced.jar, which 
should have the things listed below and it does not, in only had guava, apache 
commons and fastutils. It is supposed to have:

 {{ 

  true
  
  
  
 META-INF/LICENSE
  
  
  runtime
  /
  true
  

com.google.guava:guava
com.github.scopt_${scala.compat.version}
com.tdunning:t-digest
org.apache.commons:commons-math3
it.unimi.dsi:fastutil

org.apache.mahout:mahout-native-viennacl_${scala.compat.version}

org.apache.mahout:mahout-native-viennacl-omp_${scala.compat.version}
org.bytedeco:javacpp
  

  }}

This all leads me to believe that something in the build no longer makes that 
dependency-reduced.jar available to the Java Driver code since those other libs 
in the assembly are probably all hadoop or Spark Executor code, not needed in 
the Mahout driver. This is likely to have been a side effect of the build 
refactoring

[~rawkintrevo_apache] does "dependencies-reduced.jar" which contains Scopt get 
its scala.compat.version fixed? This doesn't seem to be the problem but is aa 
question nonetheless.



> Drivers broken, scopt classes not found
> ---
>
> Key: MAHOUT-2023
> URL: https://issues.apache.org/jira/browse/MAHOUT-2023
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
>     Environment: any
>        Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.1
>
>
> Type `mahout spark-itemsimilarity` after Mahout is installed properly and you 
&

Re: Error in getting Total Events in a predictionIo App

2017-11-14 Thread Pat Ferrel
You should use pio 0.12.0 if you need Elasticsearch 5.x


On Nov 14, 2017, at 6:39 AM, Abhimanyu Nagrath  
wrote:

Hi , I am new to predictionIo using version V0.11-incubating (spark - 2.6.1 , 
hbase - 1.2.6 , elasticsearch - 5.2.1) . Started the prediction server with 
./pio-start-all and checked Pio status these are working fine. Then I created 
an app 'testApp' and imported some events into that predictionIO app, Now 
inorder to verify the count of imported events .I ran the following commands 

 1. pio-shell --with-spark
 2. import org.apache.predictionio.data.store.PEventStore
 3. val eventsRDD = PEventStore.find(appName="testApp")(sc)

I got the error:

ERROR Storage$: Error initializing storage client for source ELASTICSEARCH
java.lang.ClassNotFoundException: elasticsearch.StorageClient
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at 
org.apache.predictionio.data.storage.Storage$.getClient(Storage.scala:228)
at 
org.apache.predictionio.data.storage.Storage$.org$apache$predictionio$data$storage$Storage$$updateS2CM(Storage.scala:254)
at 
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:215)
at 
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:215)
at 
scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
at 
org.apache.predictionio.data.storage.Storage$.sourcesToClientMeta(Storage.scala:215)
at 
org.apache.predictionio.data.storage.Storage$.getDataObject(Storage.scala:284)
at 
org.apache.predictionio.data.storage.Storage$.getDataObjectFromRepo(Storage.scala:269)
at 
org.apache.predictionio.data.storage.Storage$.getMetaDataApps(Storage.scala:387)
at 
org.apache.predictionio.data.store.Common$.appsDb$lzycompute(Common.scala:27)
at org.apache.predictionio.data.store.Common$.appsDb(Common.scala:27)
at 
org.apache.predictionio.data.store.Common$.appNameToId(Common.scala:32)
at 
org.apache.predictionio.data.store.PEventStore$.find(PEventStore.scala:71)
at 
$line19.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
at $line19.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $line19.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
at $line19.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:37)
at $line19.$read$$iwC$$iwC$$iwC$$iwC.(:39)
at $line19.$read$$iwC$$iwC$$iwC.(:41)
at $line19.$read$$iwC$$iwC.(:43)
at $line19.$read$$iwC.(:45)
at $line19.$read.(:47)
at $line19.$read$.(:51)
at $line19.$read$.()
at $line19.$eval$.(:7)
at $line19.$eval$.()
at $line19.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org 
$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org 

Re: Which template for predicting ratings?

2017-11-13 Thread Pat Ferrel
What I was saying is the UR can use ratings, but not predict them. Use MLlib 
ALS recommenders if you want to predict them for all items.


On Nov 13, 2017, at 9:32 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

What we did in the article I attached is assume 1-2 is dislike, and 4-5 is like.

These are treated as indicators and will produce a score from the recommender 
but these do not relate to 1-5 scores.

If you need to predict what the user would score an item MLlib ALS templates 
will do it.



On Nov 13, 2017, at 2:42 AM, Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:

Hi Pat,

I truly appreciate your advice.

However, what to do with a client that is adamant that they want to display the 
predicted ratings in the form of 1 to 5-stars? That's my case right now. 

I will pose a more concrete question. Is there any template for which the 
scores predicted by the algorithm are in the same range as the ratings in the 
training set?

Thank you very much for your help!
Noelia

On 10 November 2017 at 17:57, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
Any of the Spark MLlib ALS recommenders in the PIO template gallery support 
ratings.

However I must warn that ratings are not very good for recommendations and none 
of the big players use ratings anymore, Netflix doesn’t even display them. The 
reason is that your 2 may be my 3 or 4 and that people rate different 
categories differently. For instance Netflix found Comedies were rated lower 
than Independent films. There have been many solutions proposed and tried but 
none have proven very helpful.

There is another more fundamental problem, why would you want to recommend the 
highest rated item? What do you buy on Amazon or watch on Netflix? Are they 
only your highest rated items. Research has shown that they are not. There was 
a whole misguided movement around ratings that affected academic papers and 
cross-validation metrics that has fairly well been discredited. It all came 
from the Netflix prize that used both. Netflix has since led the way in 
dropping ratings as they saw the things I have mentioned.

What do you do? Categorical indicators work best (like, dislike)or implicit 
indicators (buy) that are unambiguous. If a person buys something, they like 
it, if the rate it 3 do they like it? I buy many 3 rated items on Amazon if I 
need them. 

My advice is drop ratings and use thumbs up or down. These are unambiguous and 
the thumbs down can be used in some cases to predict thumbs up: 
https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ 
<https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/>
 This uses data from a public web site to show significant lift by using “like” 
and “dislike” in recommendations. This used the Universal Recommender.


On Nov 10, 2017, at 5:02 AM, Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:


Hi all,

I'm new to PredictionIO so I apologise if this question is silly.

I have an application in which users are rating different items in a scale of 1 
to 5 stars. I want to recommend items to a new user and give her the predicted 
rating in number of stars. Which template should I use to do this? Note that I 
need the predicted rating to be in the same range of 1 to 5 stars.

Is it possible to do this with the ecommerce recommendation engine?

Thank you very much for your help!
Noelia









-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/> <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>



Re: Does PIO support [ --master yarn --deploy-mode cluster ]?

2017-11-13 Thread Pat Ferrel
yarn-cluster mode is supported but extra config needs to be set so the driver 
can be run on a remote machine.

I have seen instructions for this on the PIO mailing list.



On Nov 12, 2017, at 7:30 PM, wei li  wrote:

Hi Pat
Thanks a lot for your advice.

We are using [yarn-client] mode now, UR trains well and we can monitor the 
output log at pio application console.

I tried to find a way to use [yarn-cluster] mode, to submit a train job and 
shutdown the pio application (in docker) immediately.
(monitor the job process at hadoop culster website instead of pio application 
console).
But then I met errors like this: file path [file://xxx.jar] can not be found.

Maybe,  [yarn-cluster] mode is not supported now. I will keep looking for the 
explanation.


在 2017年11月11日星期六 UTC+8上午12:41:33,pat写道:
Yes PIO support Yarn but you may have more luck getting an explanation on the 
PredictionIO mailing list.
Subscribe here: http://predictionio.incubator.apache.org/support/ 


On Nov 9, 2017, at 11:33 PM, wei li  wrote:

Hi, all

Any one have any idea about this?

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-use...@googlegroups.com .
To post to this group, send email to action...@googlegroups.com .
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/af5c6748-ae7f-4c05-bbc5-6dcf6c1a480a%40googlegroups.com
 
.
For more options, visit https://groups.google.com/d/optout 
.


-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com 
.
To post to this group, send email to actionml-u...@googlegroups.com 
.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/8668b1a1-09b9-4de8-aedb-5b786a9cf7e4%40googlegroups.com
 
.
For more options, visit https://groups.google.com/d/optout 
.



Re: "LLR with time"

2017-11-12 Thread Pat Ferrel
oid interesting content being
> overwhelmed. You might also be able to spot content that has intense
> interest from a sub-population as opposed to diffuse interest from a mass
> population.
> 
> You can also use novelty and trending boosts for content in the normal
> recommendation engine. I have avoided this in the past because I felt it
> was better to have specialized pages for what's new and hot rather than
> because I had data saying it was bad to do. I have put a very weak
> recommendation effect on the what's hot pages so that people tend to see
> trending material that they like. That doesn't help on what's new pages for
> obvious reasons unless you use a touch of second order recommendation.
> 
> 
> 
> 
> 
> On Sat, Nov 11, 2017 at 11:00 PM, Johannes Schulte <
> johannes.schu...@gmail.com> wrote:
> 
>> Well the greece thing was just an example for a thing you don't know
>> upfront - it could be any of the modeled feature on the cross recommender
>> input side (user segment, country, city, previous buys), some
> subpopulation
>> getting active, so the current approach, probably with sampling that
>> favours newer events, will be the best here. Luckily a sampling strategy
> is
>> a big topic anyway since we're trying to go for the near real time way -
>> pat, you talked about it some while ago on this list and i still have to
>> look at the flink talk from trevor grant but I'm really eager to attack
>> this after years of batch :)
>> 
>> Thanks for your thoughts, I am happy I can rule something out given the
>> domain (poisson llr). Luckily the domain I'm working on is event
>> recommendations, so there is a natural deterministic item expiry (as
>> compared to christmas like stuff).
>> 
>> Again,
>> thanks!
>> 
>> 
>> On Sat, Nov 11, 2017 at 7:00 PM, Ted Dunning <ted.dunn...@gmail.com>
>> wrote:
>> 
>>> Inline.
>>> 
>>> On Sat, Nov 11, 2017 at 6:31 PM, Pat Ferrel <p...@occamsmachete.com>
>> wrote:
>>> 
>>>> If Mahout were to use http://bit.ly/poisson-llr it would tend to
> favor
>>>> new events in calculating the LLR score for later use in the
> threshold
>>> for
>>>> whether a co or cross-occurrence iss incorporated in the model.
>>> 
>>> 
>>> I don't think that this would actually help for most recommendation
>>> purposes.
>>> 
>>> It might help to determine that some item or other has broken out of
>>> historical rates. Thus, we might have "hotness" as a detected feature
>> that
>>> could be used as a boost at recommendation time. We might also have
> "not
>>> hotness" as a negative boost feature.
>>> 
>>> Since we have a pretty good handle on the "other" counts, I don't think
>>> that the Poisson test would help much with the cooccurrence stuff
> itself.
>>> 
>>> Changing the sampling rule could make a difference to temporality and
>> would
>>> be more like what Johannes is asking about.
>>> 
>>> 
>>>> But it doesn’t relate to popularity as I think Ted is saying.
>>>> 
>>>> Are you looking for 1) personal recommendations biased by hotness in
>>>> Greece or 2) things hot in Greece?
>>>> 
>>>> 1) create a secondary indicator for “watched in some locale” the
>> local-id
>>>> uses a country-code+postal-code maybe but not lat-lon. Something that
>>>> includes a good number of people/events. The the query would be
>> user-id,
>>>> and user-locale. This would yield personal recs preferred in the
> user’s
>>>> locale. Athens-west-side in this case.
>>>> 
>>> 
>>> And this works in the current regime. Simply add location tags to the
>> user
>>> histories and do cooccurrence against content. Locations will pop out
> as
>>> indicators for some content and not for others. Then when somebody
>> appears
>>> in some location, their tags will retrieve localized content.
>>> 
>>> For localization based on strict geography, say for restaurant search,
> we
>>> can just add business rules based on geo-search. A very large bank
>> customer
>>> of ours does that, for instance.
>>> 
>>> 
>>>> 2) split the data into locales and do the hot calc I mention. The
> query
>>>> would have no user-id since it is not personalized but would yield
> “hot
>>> in
>>>> Greece”
>>>> 
>>> 
>>> I t

Re: "LLR with time"

2017-11-11 Thread Pat Ferrel
If Mahout were to use http://bit.ly/poisson-llr it would tend to favor new 
events in calculating the LLR score for later use in the threshold for whether 
a co or cross-occurrence iss incorporated in the model. This is very 
interesting and would be useful in cases where you can keep a lot of data or 
where recent data is far more important, like news. This is the time-aware 
G-test your are referencing as I understand it.

But it doesn’t relate to popularity as I think Ted is saying.

Are you looking for 1) personal recommendations biased by hotness in Greece or 
2) things hot in Greece?

1) create a secondary indicator for “watched in some locale” the local-id uses 
a country-code+postal-code maybe but not lat-lon. Something that includes a 
good number of people/events. The the query would be user-id, and user-locale. 
This would yield personal recs preferred in the user’s locale. Athens-west-side 
in this case.
2) split the data into locales and do the hot calc I mention. The query would 
have no user-id since it is not personalized but would yield “hot in Greece”

Ted’s “Christmas video” tag is what I was calling a business rule and can be 
added to either of the above techniques.

On Nov 11, 2017, at 4:01 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

So ... there are a few different threads here.

1) LLR but with time. Quite possible, but not really what Johannes is
talking about, I think. See http://bit.ly/poisson-llr for a quick
discussion.

2) time varying recommendation. As Johannes notes, this can make use of
windowed counts. The problem is that rarely accessed items should probably
have longer windows so that we use longer term trends when we have less
data.

The good news here is that this some part of this is nearly already in the
code. The trick is that the down-sampling used in the system can be adapted
to favor recent events over older ones. That means that if the meaning of
something changes over time, the system will catch on. Likewise, if
something appears out of nowhere, it will quickly train up. This handles
the popular in Greece right now problem.

But this isn't the whole story of changing recommendations. Another problem
that we commonly face is what I call the christmas music issue. The idea is
that there are lots of recommendations for music that are highly seasonal.
Thus, Bing Crosby fans want to hear White Christmas
<https://www.youtube.com/watch?v=P8Ozdqzjigg> until the day after christmas
at which point this becomes a really bad recommendation. To some degree,
this can be partially dealt with by using temporal tags as indicators, but
that doesn't really allow a recommendation to be completely shut down.

The only way that I have seen to deal with this in the past is with a
manually designed kill switch. As much as possible, we would tag the
obviously seasonal content and then add a filter to kill or downgrade that
content the moment it went out of fashion.



On Sat, Nov 11, 2017 at 9:43 AM, Johannes Schulte <
johannes.schu...@gmail.com> wrote:

> Pat, thanks for your help. especially the insights on how you handle the
> system in production and the tips for multiple acyclic buckets.
> Doing the combination signalls when querying sounds okay but as you say,
> it's always hard to find the right boosts without setting up some ltr
> system. If there would be a way to use the hotness when calculating the
> indicators for subpopulations it would be great., especially for a cross
> recommender.
> 
> e.g. people in greece _now_ are viewing this show/product  whatever
> 
> And here the popularity of the recommended item in this subpopulation could
> be overrseen when just looking at the overall derivatives of activity.
> 
> Maybe one could do multiple G-Tests using sliding windows
> * itemA  vs population (classic)
> * itemA(t) vs itemA(t-1)
> ..
> 
> and derive multiple indicators per item to be indexed.
> 
> But this all relies on discretizing time into buckets and not looking at
> the distribution of time between events like in presentation above - maybe
> there is  something way smarter
> 
> Johannes
> 
> On Sat, Nov 11, 2017 at 2:50 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
> 
>> BTW you should take time buckets that are relatively free of daily cycles
>> like 3 day, week, or month buckets for “hot”. This is to remove cyclical
>> affects from the frequencies as much as possible since you need 3 buckets
>> to see the change in change, 2 for the change, and 1 for the event
> volume.
>> 
>> 
>> On Nov 10, 2017, at 4:12 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> 
>> So your idea is to find anomalies in event frequencies to detect “hot”
>> items?
>> 
>> Interesting, maybe Ted will chime in.
>> 
>> What I do is take the frequency, first, and second, derivatives as
>>

Re: "LLR with time"

2017-11-10 Thread Pat Ferrel
BTW you should take time buckets that are relatively free of daily cycles like 
3 day, week, or month buckets for “hot”. This is to remove cyclical affects 
from the frequencies as much as possible since you need 3 buckets to see the 
change in change, 2 for the change, and 1 for the event volume.


On Nov 10, 2017, at 4:12 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

So your idea is to find anomalies in event frequencies to detect “hot” items?

Interesting, maybe Ted will chime in.

What I do is take the frequency, first, and second, derivatives as measures of 
popularity, increasing popularity, and increasingly increasing popularity. Put 
another way popular, trending, and hot. This is simple to do by taking 1, 2, or 
3 time buckets and looking at the number of events, derivative (difference), 
and second derivative. Ranking all items by these value gives various measures 
of popularity or its increase. 

If your use is in a recommender you can add a ranking field to all items and 
query for “hot” by using the ranking you calculated. 

If you want to bias recommendations by hotness, query with user history and 
boost by your hot field. I suspect the hot field will tend to overwhelm your 
user history in this case as it would if you used anomalies so you’d also have 
to normalize the hotness to some range closer to the one created by the user 
history matching score. I haven’t found a vey good way to mix these in a model 
so use hot as a method of backfill if you cannot return enough recommendations 
or in places where you may want to show just hot items. There are several 
benefits to this method of using hot to rank all items including the fact that 
you can apply business rules to them just as normal recommendations—so you can 
ask for hot in “electronics” if you know categories, or hot "in-stock" items, 
or ...

Still anomaly detection does sound like an interesting approach.


On Nov 10, 2017, at 3:13 PM, Johannes Schulte <johannes.schu...@gmail.com> 
wrote:

Hi "all",

I am wondering what would be the best way to incorporate event time
information into the calculation of the G-Test.

There is a claim here
https://de.slideshare.net/tdunning/finding-changes-in-real-data

saying "Time aware variant of G-Test is possible"

I remember i experimented with exponentially decayed counts some years ago
and this involved changing the counts to doubles, but I suspect there is
some smarter way. What I don't get is the relation to a data structure like
T-Digest when working with a lot of counts / cells for every combination of
items. Keeping a t-digest for every combination seems unfeasible.

How would one incorporate event time into recommendations to detect
"hotness" of certain relations? Glad if someone has an idea...

Cheers,

Johannes




Re: "LLR with time"

2017-11-10 Thread Pat Ferrel
So your idea is to find anomalies in event frequencies to detect “hot” items?

Interesting, maybe Ted will chime in.

What I do is take the frequency, first, and second, derivatives as measures of 
popularity, increasing popularity, and increasingly increasing popularity. Put 
another way popular, trending, and hot. This is simple to do by taking 1, 2, or 
3 time buckets and looking at the number of events, derivative (difference), 
and second derivative. Ranking all items by these value gives various measures 
of popularity or its increase. 

If your use is in a recommender you can add a ranking field to all items and 
query for “hot” by using the ranking you calculated. 

If you want to bias recommendations by hotness, query with user history and 
boost by your hot field. I suspect the hot field will tend to overwhelm your 
user history in this case as it would if you used anomalies so you’d also have 
to normalize the hotness to some range closer to the one created by the user 
history matching score. I haven’t found a vey good way to mix these in a model 
so use hot as a method of backfill if you cannot return enough recommendations 
or in places where you may want to show just hot items. There are several 
benefits to this method of using hot to rank all items including the fact that 
you can apply business rules to them just as normal recommendations—so you can 
ask for hot in “electronics” if you know categories, or hot "in-stock" items, 
or ...

Still anomaly detection does sound like an interesting approach.

 
On Nov 10, 2017, at 3:13 PM, Johannes Schulte  
wrote:

Hi "all",

I am wondering what would be the best way to incorporate event time
information into the calculation of the G-Test.

There is a claim here
https://de.slideshare.net/tdunning/finding-changes-in-real-data

saying "Time aware variant of G-Test is possible"

I remember i experimented with exponentially decayed counts some years ago
and this involved changing the counts to doubles, but I suspect there is
some smarter way. What I don't get is the relation to a data structure like
T-Digest when working with a lot of counts / cells for every combination of
items. Keeping a t-digest for every combination seems unfeasible.

How would one incorporate event time into recommendations to detect
"hotness" of certain relations? Glad if someone has an idea...

Cheers,

Johannes



Re: PIO + ES5 + Universal Recommender

2017-11-08 Thread Pat Ferrel
“mvn not found”, install mvn. 

This step will go away with the next Mahout release.


On Nov 8, 2017, at 2:41 AM, Noelia Osés Fernández <no...@vicomtech.org> wrote:

Thanks Pat!

I have followed the instructions on the README.md file of the mahout folder:


You will need to build this using Scala 2.11. Follow these instructions

 - install Scala 2.11 as your default version

I've done this with the following commands:

# scala install
wget www.scala-lang.org/files/archive/scala-2.11.7.deb 
<http://www.scala-lang.org/files/archive/scala-2.11.7.deb>
sudo dpkg -i scala-2.11.7.deb
# sbt installation
echo "deb https://dl.bintray.com/sbt/debian <https://dl.bintray.com/sbt/debian> 
/" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 
<http://keyserver.ubuntu.com/> --recv 642AC823
sudo apt-get update
sudo apt-get install sbt

 - download this repo: `git clone https://github.com/actionml/mahout.git` 
<https://github.com/actionml/mahout.git%60>
 - checkout the speedup branch: `git checkout sparse-speedup-13.0`
 - edit the build script `build-scala-2.11.sh <http://build-scala-2.11.sh/>` to 
put the custom repo where you want it

This file is now:

#!/usr/bin/env bash

git checkout sparse-speedup-13.0

mvn clean package -DskipTests -Phadoop2 -Dspark.version=2.1.1 
-Dscala.version=2.11.11 -Dscala.compat.version=2.11

echo "Make sure to put the custom repo in the right place for your machine!"
echo "This location will have to be put into the Universal Recommenders 
build.sbt"

mvn deploy:deploy-file 
-Durl=file:///home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/.custom-scala-m2/repo/
 
-Dfile=//home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/mahout/hdfs/target/mahout-hdfs-0.13.0.jar
 -DgroupId=org.apache.mahout -DartifactId=mahout-hdfs -Dversion=0.13.0
mvn deploy:deploy-file 
-Durl=file:///home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/.custom-scala-m2/repo/
 
-Dfile=//home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/mahout/math/target/mahout-math-0.13.0.jar
 -DgroupId=org.apache.mahout -DartifactId=mahout-math -Dversion=0.13.0
mvn deploy:deploy-file 
-Durl=file:///home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/.custom-scala-m2/repo/
 
-Dfile=//home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/mahout/math-scala/target/mahout-math-scala_2.11-0.13.0.jar
 -DgroupId=org.apache.mahout -DartifactId=mahout-math-scala_2.11 
-Dversion=0.13.0
mvn deploy:deploy-file 
-Durl=file:///home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/.custom-scala-m2/repo/
 
-Dfile=//home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/mahout/spark/target/mahout-spark_2.11-0.13.0.jar
 -DgroupId=org.apache.mahout -DartifactId=mahout-spark_2.11 -Dversion=0.13.0

 - execute the build script `build-scala-2.11.sh <http://build-scala-2.11.sh/>`

This outputed the following:

$ ./build-scala-2.11.sh <http://build-scala-2.11.sh/> 
Mbuild-scala-2.11.sh <http://build-scala-2.11.sh/>
Already on 'sparse-speedup-13.0'
Your branch is up-to-date with 'origin/sparse-speedup-13.0'.
./build-scala-2.11.sh <http://build-scala-2.11.sh/>: line 5: mvn: command not 
found
Make sure to put the custom repo in the right place for your machine!
This location will have to be put into the Universal Recommenders build.sbt
./build-scala-2.11.sh <http://build-scala-2.11.sh/>: line 10: mvn: command not 
found
./build-scala-2.11.sh <http://build-scala-2.11.sh/>: line 11: mvn: command not 
found
./build-scala-2.11.sh <http://build-scala-2.11.sh/>: line 12: mvn: command not 
found
./build-scala-2.11.sh <http://build-scala-2.11.sh/>: line 13: mvn: command not 
found


Do I need to install MAVEN? If so, it is not said in the PredictionIO 
installation instructions nor on the Mahout instructions. 

I apologise if this is an obvious question for those familiar with the Apache 
projects, but for an outsider like me it helps when everything (even the most 
silly details) is spelled out. Thanks a lot for all your invaluable help!!
 

On 7 November 2017 at 20:58, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
Very sorry, it was incorrectly set to private. Try it again.




On Nov 7, 2017, at 7:26 AM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:

https://github.com/actionml/mahout <https://github.com/actionml/mahout>





-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigado

Re: PIO + ES5 + Universal Recommender

2017-11-07 Thread Pat Ferrel
Very sorry, it was incorrectly set to private. Try it again.




On Nov 7, 2017, at 7:26 AM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:

https://github.com/actionml/mahout <https://github.com/actionml/mahout>




Re: PIO + ES5 + Universal Recommender

2017-11-07 Thread Pat Ferrel
Very sorry, it was incorrectly set to private. Try it again.



On Nov 7, 2017, at 12:52 AM, Noelia Osés Fernández <no...@vicomtech.org> wrote:

Thank you, Pat!

I have a problem with the Mahout repo, though. I get the following error 
message:

remote: Repository not found.
fatal: repository 'https://github.com/actionml/mahout.git/ 
<https://github.com/actionml/mahout.git/>' not found


On 3 November 2017 at 22:27, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
The exclusion rules are working now along with the integration-test. We have 
some cleanup but please feel free to try it.

Please note the upgrade issues mentioned below before you start, fresh installs 
should have no such issues.


On Nov 1, 2017, at 4:30 PM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:

Ack, I hate this &^%&%^&  touchbar!

What I meant to say was:


We have a version of the universal recommender working with PIO-0.12.0 that is 
ready for brave souls to test. This includes some speedups and quality of 
recommendation improvements, not yet documented. 

Known bugs: exclusion rules not working. This will be fixed before release in 
the next few days

Issues: do not trust the integration test, Lucene and ES have changed their 
scoring method and so you cannot compare the old scores to the new ones. The 
test will be fixed before release but do trust it to populate PIO with some 
sample data you can play with.

You must build PredictionIO with the default parameters so just run 
`./make-distribution` this will require you to install Scala 2.11, Spark 2.1.1 
or greater, ES 5.5.2 or greater, Hadoop 2.6 or greater. If you have issues 
getting pio to build and run send questions to the PIO mailing list. Once PIO 
is running test with `pio status` and `pio app list`. You will need to create 
an app in import your data to run the integration test to get some sample data 
installed in the “handmade” app.

*Backup your data*, moving from ES 1 to ES 5 will delete all data Actually 
even worse it is still in HBase but you can’t get at it so to upgrade so the 
following
`pio export` with pio < 0.12.0 =*Before upgrade!*=
`pio data-delete` all your old apps =*Before upgrade!*=
build and install pio 0.12.0 including all the services =*The point of no 
return!*=
`pio app new …` and `pio import…` any needed datasets
download and build Mahout for Scala 2.11 from this repo: 
https://github.com/actionml/mahout.git <https://github.com/actionml/mahout.git> 
follow the instructions in the README.md
download the UR from here: 
https://github.com/actionml/universal-recommender.git 
<https://github.com/actionml/universal-recommender.git> and checkout branch 
0.7.0-SNAPSHOT
replace the line: `resolvers += "Local Repository" at 
"file:///Users/pat/.custom-scala-m2/repo”` <> with your path to the local 
mahout build
build the UR with `pio build` or run the integration test to get sample data 
put into PIO `./examples/integration-test`

This will use the released PIO and alpha UR

This will be much easier when it’s released

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com 
<mailto:actionml-user+unsubscr...@googlegroups.com>.
To post to this group, send email to actionml-u...@googlegroups.com 
<mailto:actionml-u...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/326BE669-574B-45A5-AAA5-6A285BA0B33E%40occamsmachete.com
 
<https://groups.google.com/d/msgid/actionml-user/326BE669-574B-45A5-AAA5-6A285BA0B33E%40occamsmachete.com?utm_medium=email_source=footer>.
For more options, visit https://groups.google.com/d/optout 
<https://groups.google.com/d/optout>.




-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/> <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com 
<mailto:actionml-user+unsubscr...@googlegroups.com>.
To post to this group, send email to actionml-u...@googlegroups.com 
<

Re: Implementing cart and wishlist item events into Ecommerce recommendation template

2017-11-04 Thread Pat Ferrel
Oh, forgot to say the most important part. The ECom recommender does not 
support shopping carts unless you train on (cart-id, item-id-of item 
added-to-cart) And even then I’m not sure you can query with the current cart’s 
contents since the item-based query is for a single item. The cart-id takes the 
place of user-id in this method of training and there may be a way to do this 
in the MLlib implementation but It isn’t surfaced in the PIO interface. It 
would be explained as an anonymous user (one not in the training data) and will 
take an item list in the query. Look into the MLlib ALS library and expect to 
modify the template code.

There is also the Complimentary Purchase template, which does shopping carts 
but, from my rather prejudiced viewpoint, if you need to switch templates use 
one that supports every use-case you are likely to need.


On Nov 4, 2017, at 9:34 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

The Universal Recommender supports several types of “item-set” recommendations:
1) Complimentary Purchases. which are things bought with what you have in the 
shopping cart. This is done by training on (cart-id, “add-to-cart”, item-id) 
and querying with the current items in the user’s cart. 
2) Similar items to those in the cart, this is done by training with the 
typical events like purchase, detail-view, add-to-cart., etc. for each user, 
then the query is the contents of the shopping cart as a “item-set”. This give 
things similar to what is in the cart and usually not the precise semantics for 
a shopping cart but fits other cases of using an items-set, like wish-lists
3) take the last n items viewed and query with them and you have 
“recommendations based on your recent views” In this case you need purchases as 
the primary event because you want to recommend purchases but using only 
“detail-views” to do so. 
4) some other combinations like favorites, watch-lists, etc.

These work slightly different and I could give examples of how they are used in 
Amazon but #1 is typically used for the “shopping cart"


On Nov 3, 2017, at 7:13 PM, ilker burak <ilkerbu...@gmail.com 
<mailto:ilkerbu...@gmail.com>> wrote:

Hi Vaghan,
I will check that. Thanks for your help and quick answer about this.

On Fri, Nov 3, 2017 at 8:02 AM, Vaghawan Ojha <vaghawan...@gmail.com 
<mailto:vaghawan...@gmail.com>> wrote:
Hey there, 

did you consider seeing this: 
https://predictionio.incubator.apache.org/templates/ecommercerecommendation/train-with-rate-event/
 
<https://predictionio.incubator.apache.org/templates/ecommercerecommendation/train-with-rate-event/>

for considering such events you may want to use the $set events as shown in the 
template documentation. I use universal recommender though since already 
supports these requirements. 


Hope this helps. 

On Fri, Nov 3, 2017 at 10:37 AM, ilker burak <ilkerbu...@gmail.com 
<mailto:ilkerbu...@gmail.com>> wrote:
Hello,
I am using Ecommerce recommendation template. Currently i imported view and buy 
events and it works. To improve results accuracy, how can i modify code to 
import and use events like 'user added item to cart' and 'user added item to 
wishlist'? I know this template supports to add new events but there is only 
example in site about how to implement rate event, whic i am not using rate 
data.
Thank you

Ilker




-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com 
<mailto:actionml-user+unsubscr...@googlegroups.com>.
To post to this group, send email to actionml-u...@googlegroups.com 
<mailto:actionml-u...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/741770E9-1FE6-453C-9ED2-54F1745CAE33%40occamsmachete.com
 
<https://groups.google.com/d/msgid/actionml-user/741770E9-1FE6-453C-9ED2-54F1745CAE33%40occamsmachete.com?utm_medium=email_source=footer>.
For more options, visit https://groups.google.com/d/optout 
<https://groups.google.com/d/optout>.



Re: Implementing cart and wishlist item events into Ecommerce recommendation template

2017-11-04 Thread Pat Ferrel
The Universal Recommender supports several types of “item-set” recommendations:
1) Complimentary Purchases. which are things bought with what you have in the 
shopping cart. This is done by training on (cart-id, “add-to-cart”, item-id) 
and querying with the current items in the user’s cart. 
2) Similar items to those in the cart, this is done by training with the 
typical events like purchase, detail-view, add-to-cart., etc. for each user, 
then the query is the contents of the shopping cart as a “item-set”. This give 
things similar to what is in the cart and usually not the precise semantics for 
a shopping cart but fits other cases of using an items-set, like wish-lists
3) take the last n items viewed and query with them and you have 
“recommendations based on your recent views” In this case you need purchases as 
the primary event because you want to recommend purchases but using only 
“detail-views” to do so. 
4) some other combinations like favorites, watch-lists, etc.

These work slightly different and I could give examples of how they are used in 
Amazon but #1 is typically used for the “shopping cart"


On Nov 3, 2017, at 7:13 PM, ilker burak  wrote:

Hi Vaghan,
I will check that. Thanks for your help and quick answer about this.

On Fri, Nov 3, 2017 at 8:02 AM, Vaghawan Ojha > wrote:
Hey there, 

did you consider seeing this: 
https://predictionio.incubator.apache.org/templates/ecommercerecommendation/train-with-rate-event/
 


for considering such events you may want to use the $set events as shown in the 
template documentation. I use universal recommender though since already 
supports these requirements. 


Hope this helps. 

On Fri, Nov 3, 2017 at 10:37 AM, ilker burak > wrote:
Hello,
I am using Ecommerce recommendation template. Currently i imported view and buy 
events and it works. To improve results accuracy, how can i modify code to 
import and use events like 'user added item to cart' and 'user added item to 
wishlist'? I know this template supports to add new events but there is only 
example in site about how to implement rate event, whic i am not using rate 
data.
Thank you

Ilker





Re: PIO + ES5 + Universal Recommender

2017-11-01 Thread Pat Ferrel
Ack, I hate this &^%&%^&  touchbar!

What I meant to say was:


We have a version of the universal recommender working with PIO-0.12.0 that is 
ready for brave souls to test. This includes some speedups and quality of 
recommendation improvements, not yet documented. 

Known bugs: exclusion rules not working. This will be fixed before release in 
the next few days

Issues: do not trust the integration test, Lucene and ES have changed their 
scoring method and so you cannot compare the old scores to the new ones. The 
test will be fixed before release but do trust it to populate PIO with some 
sample data you can play with.

You must build PredictionIO with the default parameters so just run 
`./make-distribution` this will require you to install Scala 2.11, Spark 2.1.1 
or greater, ES 5.5.2 or greater, Hadoop 2.6 or greater. If you have issues 
getting pio to build and run send questions to the PIO mailing list. Once PIO 
is running test with `pio status` and `pio app list`. You will need to create 
an app in import your data to run the integration test to get some sample data 
installed in the “handmade” app.

*Backup your data*, moving from ES 1 to ES 5 will delete all data Actually 
even worse it is still in HBase but you can’t get at it so to upgrade so the 
following
`pio export` with pio < 0.12.0 =*Before upgrade!*=
`pio data-delete` all your old apps =*Before upgrade!*=
build and install pio 0.12.0 including all the services =*The point of no 
return!*=
`pio app new …` and `pio import…` any needed datasets
download and build Mahout for Scala 2.11 from this repo: 
https://github.com/actionml/mahout.git  
follow the instructions in the README.md
download the UR from here: 
https://github.com/actionml/universal-recommender.git 
 and checkout branch 
0.7.0-SNAPSHOT
replace the line: `resolvers += "Local Repository" at 
"file:///Users/pat/.custom-scala-m2/repo”` 
 with your path to the 
local mahout build
build the UR with `pio build` or run the integration test to get sample data 
put into PIO `./examples/integration-test`

This will use the released PIO and alpha UR

This will be much easier when it’s released

PIO + ES5 + Universal Recommender

2017-11-01 Thread Pat Ferrel
We have a version working here: 
https://github.com/actionml/universal-recommender.git 

checkout 0.7.0-SNAPSHOT once you pull the repo. 

Known bug: exclusion rules not working. This will be fixed before release in 
the next few days

Issues: do not trust the integration test, Lucene and ES have changed their 
scoring method and so you cannot compare the old scores to the new ones. the 
test will be fixed before release.

You must build the Template with pio v0.12.0 using Scala 2.11, Spark 2.2.1, ES 
5.

[jira] [Comment Edited] (MAHOUT-2020) Maven repo structure malformed

2017-11-01 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234411#comment-16234411
 ] 

Pat Ferrel edited comment on MAHOUT-2020 at 11/1/17 5:22 PM:
-

Nothing to do with SBT, look in the parent POM, it always has 
{scala.combat.version} = 2.10 the rest of the child poms inherit that even if 
they were built for scala 2.11


was (Author: pferrel):
Nothing to do with SBT, look in the parent POM, it always has 
{scala.combat.version} = 2.10 the rest of the child pom inherit that and it is 
wrong when building for scala 2.11

> Maven repo structure malformed
> --
>
> Key: MAHOUT-2020
> URL: https://issues.apache.org/jira/browse/MAHOUT-2020
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: Creating a project from maven built Mahout using sbt. 
> Made critical since it seems to block using Mahout with sbt. At least I have 
> found no way to do it.
>Reporter: Pat Ferrel
>Assignee: Trevor Grant
>Priority: Blocker
> Fix For: 0.13.1
>
>
> The maven repo is built with scala 2.10 always in the parent pom's 
> {scala.compat.version} even when you only ask for Scala 2.11, this leads to 
> the 2.11 jars never being found. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (MAHOUT-1951) Drivers don't run with remote Spark

2017-10-31 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1951:
---
Comment: was deleted

(was: odd, I did not resolve this as history says, at least did not mean to.)

> Drivers don't run with remote Spark
> ---
>
> Key: MAHOUT-1951
> URL: https://issues.apache.org/jira/browse/MAHOUT-1951
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification, CLI, Collaborative Filtering
>Affects Versions: 0.13.0
> Environment: The command line drivers spark-itemsimilarity and 
> spark-naivebayes using a remote or pseudo-clustered Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Missing classes when running these jobs because the dependencies-reduced jar, 
> passed to Spark for serialization purposes, does not contain all needed 
> classes.
> Found by a user. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MAHOUT-2020) Maven repo structure malformed

2017-10-31 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190340#comment-16190340
 ] 

Pat Ferrel edited comment on MAHOUT-2020 at 11/1/17 1:23 AM:
-

This may be a non-issue. Trevor said in email:

{quote}The spark is included via maven classifier-

the sbt line should be

libraryDependencies += "org.apache.mahout" % "mahout-spark_2.11" %
"0.13.1-SNAPSHOT" classifier "spark_2.1"
{quote}

This is a red herring, the real issue is scala 2.11 never builds correctly 
because poms end up with 2.10 as the scala.compat.version




was (Author: pferrel):
This may be a non-issue. Trevor said in email:

{quote}The spark is included via maven classifier-

the sbt line should be

libraryDependencies += "org.apache.mahout" % "mahout-spark_2.11" %
"0.13.1-SNAPSHOT" classifier "spark_2.1"
{quote}



> Maven repo structure malformed
> --
>
> Key: MAHOUT-2020
> URL: https://issues.apache.org/jira/browse/MAHOUT-2020
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: Creating a project from maven built Mahout using sbt. 
> Made critical since it seems to block using Mahout with sbt. At least I have 
> found no way to do it.
>Reporter: Pat Ferrel
>Assignee: Trevor Grant
>Priority: Blocker
> Fix For: 0.13.1
>
>
> The maven repo is built with scala 2.10 always in the parent pom's 
> {scala.compat.version} even when you only ask for Scala 2.11, this leads to 
> the 2.11 jars never being found. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MAHOUT-2020) Maven repo structure malformed

2017-10-31 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-2020:
---
Description: The maven repo is built with scala 2.10 always in the parent 
pom's {scala.compat.version} even when you only ask for Scala 2.11, this leads 
to the 2.11 jars never being found.   (was: The maven repo should build:
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar

substitute Spark version for -2.1, so -1.6 etc.

The build.sbt  `libraryDependencies` line then will be:
`"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`

This is parsed by sbt to yield the path of :
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar

The outcome of `mvn clean install` currently is something like:
org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-spark-0.13.1-SNAPSHOT-spark_2.1.jar

This has no effect on the package structure, only artifact naming and maven 
repo structure.)

> Maven repo structure malformed
> --
>
> Key: MAHOUT-2020
> URL: https://issues.apache.org/jira/browse/MAHOUT-2020
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: Creating a project from maven built Mahout using sbt. 
> Made critical since it seems to block using Mahout with sbt. At least I have 
> found no way to do it.
>Reporter: Pat Ferrel
>Assignee: Trevor Grant
>Priority: Blocker
> Fix For: 0.13.1
>
>
> The maven repo is built with scala 2.10 always in the parent pom's 
> {scala.compat.version} even when you only ask for Scala 2.11, this leads to 
> the 2.11 jars never being found. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MAHOUT-2020) Maven repo structure compatibility with SBT

2017-10-26 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221263#comment-16221263
 ] 

Pat Ferrel edited comment on MAHOUT-2020 at 10/26/17 9:28 PM:
--

my script for creating a repo:

{
mvn deploy:deploy-file -Durl=file:///Users/pat/.custom-scala-m2/repo/ 
-Dfile=//Users/pat/mahout/hdfs/target/mahout-hdfs-0.13.0.jar 
-DgroupId=org.apache.mahout -DartifactId=mahout-hdfs -Dversion=0.13.0
mvn deploy:deploy-file -Durl=file:///Users/pat/.custom-scala-m2/repo/ 
-Dfile=//Users/pat/mahout/math/target/mahout-math-0.13.0.jar 
-DgroupId=org.apache.mahout -DartifactId=mahout-math -Dversion=0.13.0
mvn deploy:deploy-file -Durl=file:///Users/pat/.custom-scala-m2/repo/ 
-Dfile=//Users/pat/mahout/math-scala/target/mahout-math-scala_2.11-0.13.0.jar 
-DgroupId=org.apache.mahout -DartifactId=mahout-math-scala_2.11 -Dversion=0.13.0
mvn deploy:deploy-file -Durl=file:///Users/pat/.custom-scala-m2/repo/ 
-Dfile=//Users/pat/mahout/spark/target/mahout-spark_2.11-0.13.0.jar 
-DgroupId=org.apache.mahout -DartifactId=mahout-spark_2.11 -Dversion=0.13.0
}

then in the build.sbt I get the artifacts using:

{
resolvers += "Local Repository" at "file:///Users/pat/.custom-scala-m2/repo"
}


was (Author: pferrel):
my script for creating a repo:

{{mvn deploy:deploy-file -Durl=file:///Users/pat/.custom-scala-m2/repo/ 
-Dfile=//Users/pat/mahout/hdfs/target/mahout-hdfs-0.13.0.jar 
-DgroupId=org.apache.mahout -DartifactId=mahout-hdfs -Dversion=0.13.0
mvn deploy:deploy-file -Durl=file:///Users/pat/.custom-scala-m2/repo/ 
-Dfile=//Users/pat/mahout/math/target/mahout-math-0.13.0.jar 
-DgroupId=org.apache.mahout -DartifactId=mahout-math -Dversion=0.13.0
mvn deploy:deploy-file -Durl=file:///Users/pat/.custom-scala-m2/repo/ 
-Dfile=//Users/pat/mahout/math-scala/target/mahout-math-scala_2.11-0.13.0.jar 
-DgroupId=org.apache.mahout -DartifactId=mahout-math-scala_2.11 -Dversion=0.13.0
mvn deploy:deploy-file -Durl=file:///Users/pat/.custom-scala-m2/repo/ 
-Dfile=//Users/pat/mahout/spark/target/mahout-spark_2.11-0.13.0.jar 
-DgroupId=org.apache.mahout -DartifactId=mahout-spark_2.11 -Dversion=0.13.0
}}

then in the build.sbt I get the artifacts using:

{{resolvers += "Local Repository" at "file:///Users/pat/.custom-scala-m2/repo"
}}

> Maven repo structure compatibility with SBT
> ---
>
> Key: MAHOUT-2020
> URL: https://issues.apache.org/jira/browse/MAHOUT-2020
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: Creating a project from maven built Mahout using sbt. 
> Made critical since it seems to block using Mahout with sbt. At least I have 
> found no way to do it.
>Reporter: Pat Ferrel
>Assignee: Trevor Grant
>Priority: Blocker
> Fix For: 0.13.1
>
>
> The maven repo should build:
> org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar
> substitute Spark version for -2.1, so -1.6 etc.
> The build.sbt  `libraryDependencies` line then will be:
> `"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`
> This is parsed by sbt to yield the path of :
> org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar
> The outcome of `mvn clean install` currently is something like:
> org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-spark-0.13.1-SNAPSHOT-spark_2.1.jar
> This has no effect on the package structure, only artifact naming and maven 
> repo structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MAHOUT-2020) Maven repo structure compatibility with SBT

2017-10-26 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221258#comment-16221258
 ] 

Pat Ferrel commented on MAHOUT-2020:


The problem here is with the multi-artifact build. The last version I tried 
always builds scala 2.10 version(s) of modules and makes the 
{scala.compat.version} = 2.10 in the master POM of the mvn repo. This trickles 
down to the child POMs in the repo and therefore the math-scala_2.11 is not 
found because the POM is telling the resolver to look for the scala 2.10 jar. 

This is very subtle because the jars are there for scala 2.10 and 2.11 but 
their poms use the master pom's {scala.compat.version} which seem to be always 
set to 2.10.

The solution may be to not set the {scala.compat.version} in the master pom but 
set it in each scala module's pom with the correct scala version. Whatever the 
solution is must account for the possible use of more than one 
{scala.compat.version} that gets applied where the jar is stored in the maven 
directory structure.

I know this works because I have created by hand, using `mvn deploy`, a maven 
type repo that excludes the master pom and things are resolved correctly. This 
seems to mean the {scala.compat.version} is set somewhere outside of the master 
pom, not sure where.

> Maven repo structure compatibility with SBT
> ---
>
> Key: MAHOUT-2020
> URL: https://issues.apache.org/jira/browse/MAHOUT-2020
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: Creating a project from maven built Mahout using sbt. 
> Made critical since it seems to block using Mahout with sbt. At least I have 
> found no way to do it.
>Reporter: Pat Ferrel
>Assignee: Trevor Grant
>Priority: Blocker
> Fix For: 0.13.1
>
>
> The maven repo should build:
> org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar
> substitute Spark version for -2.1, so -1.6 etc.
> The build.sbt  `libraryDependencies` line then will be:
> `"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`
> This is parsed by sbt to yield the path of :
> org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar
> The outcome of `mvn clean install` currently is something like:
> org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-spark-0.13.1-SNAPSHOT-spark_2.1.jar
> This has no effect on the package structure, only artifact naming and maven 
> repo structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Templates First

2017-10-20 Thread Pat Ferrel
PredictionIO is completely useless without a Template yet we seem as a group 
too focused on releasing PIO without regard for Templates. This IMO must 
change. 90% of users will never touch the code of a template and only 1% will 
actually create a template. These guesses come from list questions. If this is 
true we need to switch our mindset to "templates first” not “pio first”. Before 
any upgrade vote, every committer should make sure their favorite template 
works with the new build. I will be doing so from now on.

We have one fairly significant problem that I see from a template supporter's 
side. PIO has several new build directives that change dependencies like Spark 
version and tools like Scala version. These are unknown to templates and there 
is no PIO supported way to communicate these to the template's build.sbt. This 
leaves us with templates that will not work with most combinations of PIO 
builds. If we are lucky they may be updated to work with the *default* pio 
config. But this did not happen when PIO-0.12.0 was released, only shortly 
afterwards. This must change, the Apache templates at least must have some 
support for PIO before release and here is one idea that might help...

How do we solve this?

1) .make-distribution modifies or creates a script that can be imported by the 
templates build.sbt. This might be pio-env if we use `pio build` to build 
templates because it is available to the template’s build.sbt, or something 
else when we move to using sbt to build templates directly. This script defines 
values used to build PIO.
2) update some or all of the Apache templates to use this mechanism to build 
with the right scala version, etc. taken from the PIO build.

I had a user do this for the UR to support many different pio build directives, 
and some that are new. The result was a build.sbt that includes such things as 

val pioVersion = sys.env.getOrElse("PIO_VERSION","0.12.0-incubating”)
val scalaVersion = sys.env.getOrElse(“PIO_SCALA_VERSION”, “2.10.0”)
val elasticsearch1Version = sys.env.getOrElse("PIO_ELASTIC_VERSION","1.7.5")
val sparkVersion = sys.env.getOrElse("PIO_SPARK_VERSION","1.4.0”)

these are then used in the lib dependencies lists to pull in the right versions 
of artifacts.

This in some form would allow templates to move along in lock step with changes 
in the way pio is built on any given machine. Without something like this, 
users even less expert at sbt than myself (hard to imagine) will have a 
significant problem dumped on them.

Since this is only partially baked it may not be ready for a Jira and so 
warrants discussion.  

Templates First

2017-10-20 Thread Pat Ferrel
PredictionIO is completely useless without a Template yet we seem as a group 
too focused on releasing PIO without regard for Templates. This IMO must 
change. 90% of users will never touch the code of a template and only 1% will 
actually create a template. These guesses come from list questions. If this is 
true we need to switch our mindset to "templates first” not “pio first”. Before 
any upgrade vote, every committer should make sure their favorite template 
works with the new build. I will be doing so from now on.

We have one fairly significant problem that I see from a template supporter's 
side. PIO has several new build directives that change dependencies like Spark 
version and tools like Scala version. These are unknown to templates and there 
is no PIO supported way to communicate these to the template's build.sbt. This 
leaves us with templates that will not work with most combinations of PIO 
builds. If we are lucky they may be updated to work with the *default* pio 
config. But this did not happen when PIO-0.12.0 was released, only shortly 
afterwards. This must change, the Apache templates at least must have some 
support for PIO before release and here is one idea that might help...

How do we solve this?

1) .make-distribution modifies or creates a script that can be imported by the 
templates build.sbt. This might be pio-env if we use `pio build` to build 
templates because it is available to the template’s build.sbt, or something 
else when we move to using sbt to build templates directly. This script defines 
values used to build PIO.
2) update some or all of the Apache templates to use this mechanism to build 
with the right scala version, etc. taken from the PIO build.

I had a user do this for the UR to support many different pio build directives, 
and some that are new. The result was a build.sbt that includes such things as 

val pioVersion = sys.env.getOrElse("PIO_VERSION","0.12.0-incubating”)
val scalaVersion = sys.env.getOrElse(“PIO_SCALA_VERSION”, “2.10.0”)
val elasticsearch1Version = sys.env.getOrElse("PIO_ELASTIC_VERSION","1.7.5")
val sparkVersion = sys.env.getOrElse("PIO_SPARK_VERSION","1.4.0”)

these are then used in the lib dependencies lists to pull in the right versions 
of artifacts.

This in some form would allow templates to move along in lock step with changes 
in the way pio is built on any given machine. Without something like this, 
users even less expert at sbt than myself (hard to imagine) will have a 
significant problem dumped on them.

Since this is only partially baked it may not be ready for a Jira and so 
warrants discussion.  

Re: installing environment (stops when compiling "compiler-interface" for Scala)

2017-10-18 Thread Pat Ferrel
Memory depends on your data and the engine you are using. Spark puts all data 
into memory across the Spark cluster so if that is one machine, 4g will not 
allow more than toy or example data. Remember that PIO and Machine Learning in 
general works best with big data. 

BTW my laptop has 16g and I can only process limited data in an all-in-one 
configuration. Unless you are connecting to external services like Spark, or 
HDFS, etc, 4g will not get you very far.

 
On Oct 18, 2017, at 10:43 AM, Donald Szeto  wrote:

I would usually use at least 4GB just to be safe. If you are not looking to 
customize PIO itself, you may download pre-built binaries (which is available 
starting 0.12.0-incubating) from 
https://www.apache.org/dyn/closer.cgi/incubator/predictionio/0.12.0-incubating/apache-predictionio-0.12.0-incubating-bin.tar.gz
 
.
 That way you don't have to spin up a big machine to build it if you're just 
trying it out.

We are in the process of updating our web site to include installation using 
binaries. Apologies for the delay.

On Wed, Oct 18, 2017 at 10:38 AM, Luciano Andino > wrote:


2017-10-18 19:21 GMT+02:00 Donald Szeto >:
Looks like an out-of-memory issue here. How much memory does the build 
environment has?

A virtual server with 1GB in DigitalOcean . A system with 2GB of RAM would be 
enough?



 

On Wed, Oct 18, 2017 at 10:08 AM, Luciano Andino > wrote:
Hello, this is my first post in email list. I am trying to install environment 
but I have some problem in "compiler-interface" for Scala.
I have KEYS file and source. Java version is "1.8.0_151".


luciano@localhost:~/predic$ gpg --import KEYS
gpg: key D3541808: "Suneel Marthi (CODE SIGNING KEY) >" not changed
gpg: key 8BF4ABEB: "Donald Szeto (CODE SIGNING KEY) >" not changed
gpg: key 4719A8F4: "Chan Lee >" 
not changed
gpg: Total number processed: 3
gpg:  unchanged: 3
luciano@baturite:~/predic$ 

luciano@localhost:~/predic$ gpg --verify 
apache-predictionio-0.12.0-incubating-bin.tar.gz.asc 
apache-predictionio-0.12.0-incubating.tar.gz 
gpg: Signature made Sun 17 Sep 2017 05:30:49 PM UTC using RSA key ID 4719A8F4
gpg: BAD signature from "Chan Lee >"
luciano@localhost:~/predic$ 

luciano@localhost:~/predic$ ./make-distribution.sh 
Building binary distribution for PredictionIO 0.12.0-incubating...
+ sbt/sbt clean
[warn] Executing in batch mode.
[warn]   For better performance, hit [ENTER] to switch to interactive mode, or
[warn]   consider launching sbt without any commands, or explicitly passing 
'shell'

[...]

[info] downloading 
https://repo1.maven.org/maven2/org/scala-lang/jline/2.10.6/jline-2.10.6.jar 
 
...
[info]  [SUCCESSFUL ] org.scala-lang#jline;2.10.6!jline.jar (54ms)
[info] downloading 
https://repo1.maven.org/maven2/org/fusesource/jansi/jansi/1.4/jansi-1.4.jar 
 
...
[info]  [SUCCESSFUL ] org.fusesource.jansi#jansi;1.4!jansi.jar (50ms)
[info] Done updating.
[info] Compiling 1 Scala source to 
/home/luciano/predic/project/target/scala-2.10/sbt-0.13/classes...
[info] 'compiler-interface' not yet compiled for Scala 2.10.6. Compiling...
./make-distribution.sh: line 70: 20560 Killed  sbt/sbt 
"${JAVA_PROPS[@]}" clean
luciano@localhost:~/predic$ 

what is missing?

thanks in advance)


-- 
Luciano Andino
Ing. en Sistemas de Información
UTN FRSF - BMSTU






-- 
Luciano Andino
Ing. en Sistemas de Información
UTN FRSF - BMSTU






Upgrading to PredictionIO 0.12.0

2017-10-18 Thread Pat Ferrel
PIO-0.12.0 by default, compiles and runs expecting ES5. If you are upgrading 
(not installing from clean) you will have an issue because ES1 indexes are not 
upgradable in any simple way. The simplest way to upgrade to pio-0.12.0 and ES5 
is to do `pio export` to backup BEFORE upgrading—so export with PIO-0.11.0. 
Then do a full clean upgrade wiping all data and `pio import` the data back 
into the app names you were using. This should restore indexes in ES5 and data 
in HBase or Postgres.This will change the keys you were using with PIO-0.11.0 
unless you explicitly set them back to the same values in `pio app new…`. You 
will also need to train to get models re-created.

Upgrading is far more complicated with this release so think through what you 
are doing and be sure to backup your data.

Again these issues to not impact clean installs. 

Upgrading to PredictionIO 0.12.0

2017-10-18 Thread Pat Ferrel
PIO-0.12.0 by default, compiles and runs expecting ES5. If you are upgrading 
(not installing from clean) you will have an issue because ES1 indexes are not 
upgradable in any simple way. The simplest way to upgrade to pio-0.12.0 and ES5 
is to do `pio export` to backup BEFORE upgrading—so export with PIO-0.11.0. 
Then do a full clean upgrade wiping all data and `pio import` the data back 
into the app names you were using. This should restore indexes in ES5 and data 
in HBase or Postgres.This will change the keys you were using with PIO-0.11.0 
unless you explicitly set them back to the same values in `pio app new…`. You 
will also need to train to get models re-created.

Upgrading is far more complicated with this release so think through what you 
are doing and be sure to backup your data.

Again these issues to not impact clean installs. 

Re: PredictionIO Universal Recommender user rating

2017-10-09 Thread Pat Ferrel
Yes, this is a very important point. We have found that the % of video viewed 
is indeed a very important factor but rather than sending some fraction to 
indicate the length viewed we have taken the approach before to determine the % 
that indicates the user liked the video.

This we do by triggering a “veiw-10”, “view-25”, “view-95” etc for different 
viewing times. We found that for different content types there were different % 
of viewing that best predicts what the user will like. We found that for 
“newsy” videos “view-10” was the best indicator. This make sense because people 
often do not need all the details to understand a videos content. But for 
movies a “view-10” indicated a dislike. The User started a movie, hated it and 
stopped it. We used “view-95” as the best indicator.

1) You know your content, do you think you have multiple types of content like 
“newsy” and “stories/movies”? You may need different indicators of a user 
“like” corresponding to different % of watch based on the type
2) Gather the viewing experience as % and create categories like  “veiw-10”, 
“view-25”, “view-95” etc. Ingest each event for any given user. Run 
cross-validation tests to see which gives the best results for each type on 
content you have. If you have only one type you will find the best % to gather.
3) the problem with simply sending in the % is that for one type of content 10% 
is a like (newsy) and for another type 10% alone is a dislike (long-form 
movies) This leads us to using the categorical method for defining indicators 
to give the best result instead of using the % of video raw, which may yield 
confusing of wrong results.

The extra step of testing the indicators in #2 can make a significant 
difference in performance. 

BTW if you are able to find an indicator of dislike, this may be useful  to 
predict likes: 
https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ 



On Oct 9, 2017, at 10:23 AM, Daniel Tirdea  wrote:

Hi, 


I know there were a lot of question on this matter, I've looked everywhere but 
didn't find a good answer.

I'm using the Universal Recommender to make a recommendation system for a video 
sharing website.
I have a lot of details in terms of user behavior but the most important one ( 
at least that's what I'm now ) is the amount of seconds consumed by a visitor. 
A ration between the video length in seconds and the seconds the visitor 
actually has seen from it.

Let's say that a visitor reached a landing page with a video with total length 
of 60 seconds. If the user actually sees 60 seconds ( the video player reports 
that the video played the entire 60 seconds ) I think I can assume that the 
visitor gave an implicit score of 10 out of 10 for this video.

Is there a way I can include this value in the prediction system ? Or, order 
the returned items by this value?

Thanks for reading this, any thought will be greatly appreciated.


Thanks,
Dan



Re: Universal Recommender and PredictionIO 0.12.0 incompatibility

2017-10-06 Thread Pat Ferrel
Yes, easy enough to do but rather annoying. Rather than supporting 1.x and 5.x 
I think the UR 0.6.1 will be EOL for ES 1.x and UR 0.7.0 will be ES 5.x from 
then on. We have a major speedup in 0.7.0 and an even greater speedup in the 
next after that, which will start using Spark dataframes instead of Mahout 
Matrices and BiMaps. We are also starting to design a Kappa style version of 
the UR that will be based on Harness instead of PredictionIO. It won’t require 
a training phase or Spark at all if you don’t need to bootstrap from old data.


On Oct 6, 2017, at 3:19 PM, Mars Hall <mars.h...@salesforce.com> wrote:

Hi Pat,

On 4 October 2017 at 22:04, Pat Ferrel <p...@actionml.com 
<mailto:p...@actionml.com>> wrote:
It looks like PIO 0.12.0 will require a code change in the UR. PIO changed ES1 
support drastically when ES5 support was added and broke the UR code.

We will do a quick fix to the template to address this. In the meantime stay on 
PIO 0.11.0 if you need the UR.

We dealt with these breaking Elasticsearch 5.x client changes in our fork of 
the UR. I've tried to merge these changes back with the main UR source tree, 
but the ES5 support was not already present so very difficult to pull request. 
Anyway, we definitely slayed some dragons with the help of Donald Szeto.

What we ended up with is a UR/EsClient that generates its own Elasticsearch 
RestClient using the Storage config, instead of instantiating PredictionIO's 
ESStorageClient. This solved a tangled mess of Apache HTTP dependency version 
conflicts. If you'd like to see what is working well for us (in production, 
under load), check out this [merged] PR to our fork:

https://github.com/heroku/predictionio-engine-ur/pull/7/files 
<https://github.com/heroku/predictionio-engine-ur/pull/7/files>

It would be wonderful to have your main UR working with this ES5 capability. 
Let me know if you have questions about our approach,

-- 
*Mars Hall
415-818-7039
Customer Facing Architect
Salesforce Platform / Heroku
San Francisco, California

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com 
<mailto:actionml-user+unsubscr...@googlegroups.com>.
To post to this group, send email to actionml-u...@googlegroups.com 
<mailto:actionml-u...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/CA%2BVXBtsA_C%2B_jqLw85tOZQbFwsdVM4Rb1Xhu_Fp_8%2BhX3ukEow%40mail.gmail.com
 
<https://groups.google.com/d/msgid/actionml-user/CA%2BVXBtsA_C%2B_jqLw85tOZQbFwsdVM4Rb1Xhu_Fp_8%2BhX3ukEow%40mail.gmail.com?utm_medium=email_source=footer>.
For more options, visit https://groups.google.com/d/optout 
<https://groups.google.com/d/optout>.



Re: [ERROR] Timeout when read recent events

2017-10-06 Thread Pat Ferrel
When you query for all users in batch, the system is easily overloaded. This is 
the worst case query situation where no caching applies (for instance). 

1) run batch queries at low input load time, because you are competing with 
input for access to HBase
2) throttle your query speed and/or number of connections. Hopefully you are 
making multiple connections in parallel.
3) scale up your HBase & Elasticsearch deployments, which tend to reach query 
performance limits before other services.

On Oct 6, 2017, at 12:13 AM, Mattz  wrote:

Hello,

We are using UR with PIO for generating recommendations. We run a batch program 
that concurrently queries the PIO server, generates recommendations and stores 
in a table. Currently, we are seeing the below error several times in our logs. 
Any ideas on what needs to be turned to fix this?

ERROR com.hifx.URAlgorithm [ForkJoinPool-4-worker-5] - Timeout when read recent 
events. Empty list is used. java.util.concurrent.TimeoutException: Futures 
timed out after [200 milliseconds]

Thank you!



[jira] [Created] (MAHOUT-2023) Drivers broken, scopt classes not found

2017-10-05 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-2023:
--

 Summary: Drivers broken, scopt classes not found
 Key: MAHOUT-2023
 URL: https://issues.apache.org/jira/browse/MAHOUT-2023
 Project: Mahout
  Issue Type: Bug
  Components: build
Affects Versions: 0.13.1
 Environment: any
Reporter: Pat Ferrel
Assignee: Pat Ferrel
Priority: Blocker
 Fix For: 0.13.1


Type `mahout spark-itemsimilarity` after Mahout is installed properly and you 
get a fatal exception due to missing scopt classes.

Probably a build issue related to incorrect versions of scopt being looked for.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [ERROR] [TaskSetManager] Task 2.0 in stage 10.0 had a not serializable result

2017-10-04 Thread Pat Ferrel
What version of Scala. Spark, PIO, and UR are you using?


On Oct 4, 2017, at 6:10 AM, Noelia Osés Fernández  wrote:

Hi all,

I'm still trying to create a very simple app to learn to use PredictionIO and 
still having trouble. I have done pio build no problem. But when I do pio train 
I get a very long error message related to serialisation (error message copied 
below).

pio status reports system is all ready to go.

The app I'm trying to build is very simple, it only has 'view' events. Here's 
the engine.json:

===
{
  "comment":" This config file uses default settings for all but the required 
values see README.md for docs",
  "id": "default",
  "description": "Default settings",
  "engineFactory": "com.actionml.RecommendationEngine",
  "datasource": {
"params" : {
  "name": "tiny_app_data.csv",
  "appName": "TinyApp",
  "eventNames": ["view"]
}
  },
  "algorithms": [
{
  "comment": "simplest setup where all values are default, popularity based 
backfill, must add eventsNames",
  "name": "ur",
  "params": {
"appName": "TinyApp",
"indexName": "urindex",
"typeName": "items",
"comment": "must have data for the first event or the model will not 
build, other events are optional",
"eventNames": ["view"]
  }
}
  ]
}
===

The data I'm using is:

"u1","i1"
"u2","i1"
"u2","i2"
"u3","i2"
"u3","i3"
"u4","i4"

meaning user u viewed item i.

The data has been added to the database with the following python code:

===
"""
Import sample data for recommendation engine
"""

import predictionio
import argparse
import random

RATE_ACTIONS_DELIMITER = ","
SEED = 1


def import_events(client, file):
  f = open(file, 'r')
  random.seed(SEED)
  count = 0
  print "Importing data..."

  items = []
  users = []
  f = open(file, 'r')
  for line in f:
data = line.rstrip('\r\n').split(RATE_ACTIONS_DELIMITER)
users.append(data[0])
items.append(data[1])
client.create_event(
  event="view",
  entity_type="user",
  entity_id=data[0],
  target_entity_type="item",
  target_entity_id=data[1]
)
print "Event: " + "view" + " entity_id: " + data[0] + " target_entity_id: " 
+ data[1]
count += 1
  f.close()

  users = set(users)
  items = set(items)
  print "All users: " + str(users)
  print "All items: " + str(items)
  for item in items:
client.create_event(
  event="$set",
  entity_type="item",
  entity_id=item
)
count += 1


  print "%s events are imported." % count


if __name__ == '__main__':
  parser = argparse.ArgumentParser(
description="Import sample data for recommendation engine")
  parser.add_argument('--access_key', default='invald_access_key')
  parser.add_argument('--url', default="http://localhost:7070 
")
  parser.add_argument('--file', default="./data/tiny_app_data.csv")

  args = parser.parse_args()
  print args

  client = predictionio.EventClient(
access_key=args.access_key,
url=args.url,
threads=5,
qsize=500)
  import_events(client, args.file)
===

My pio_env.sh is the following:

===
#!/usr/bin/env bash
#
# Copy this file as pio-env.sh and edit it for your site's configuration.
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#http://www.apache.org/licenses/LICENSE-2.0 

#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# PredictionIO Main Configuration
#
# This section controls core behavior of PredictionIO. It is very likely that
# you need to change these to fit your site.

# SPARK_HOME: Apache Spark is a hard dependency and must be configured.
# SPARK_HOME=$PIO_HOME/vendors/spark-2.0.2-bin-hadoop2.7
SPARK_HOME=$PIO_HOME/vendors/spark-1.6.3-bin-hadoop2.6

POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.1.4.jar
MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar

# ES_CONF_DIR: You must configure this if you have advanced configuration for
#  your Elasticsearch setup.
# ES_CONF_DIR=/opt/elasticsearch

Re: Running Mahout on a Spark cluster

2017-10-03 Thread Pat Ferrel
Thanks Trevor,

this encoding leaves the Scala version hard coded. But this is an appreciated 
clue and will get me going. There may be a way to use the %% with this or just 
explicitly add the scala version string.

@Hoa, I plan to update that repo.


On Oct 3, 2017, at 1:26 PM, Trevor Grant <trevor.d.gr...@gmail.com> wrote:

The spark is included via maven classifier-

the sbt line should be

libraryDependencies += "org.apache.mahout" % "mahout-spark_2.11" %
"0.13.1-SNAPSHOT" classifier "spark_2.1"


On Tue, Oct 3, 2017 at 2:55 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> I’m the aforementioned pferrel
> 
> @Hoa, thanks for that reference, I forgot I had that example. First don’t
> use the Hadoop part of Mahout, it is not supported and will be deprecated.
> The Spark version of cooccurrence will be supported. You find it in the
> SimilarityAnalysis object.
> 
> If you go back to the last release you should be able to make that
> https://github.com/pferrel/3-input-cooc <https://github.com/pferrel/3-
> input-cooc> work with version updates to Mahout-0.13.0 and dependencies.
> To use the latest master of Mahout, there are the problems listed below.
> 
> 
> I’m having a hard time building with sbt using the mahout-spark module
> when I build that latest mahout master with `mvn clean install`. This puts
> the mahout-spark module in the local ~/.m2 maven cache. The structure
> doesn’t match what SBT expects the path and filenames to be.
> 
> The build.sbt  `libraryDependencies` line *should* IMO be:
> `"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`
> 
> This is parsed by sbt to yield the path of :
> org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/
> mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar
> 
> unfortunately the outcome of `mvn clean install` currently is (I think):
> org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-
> spark-0.13.1-SNAPSHOT-spark_2.1.jar
> 
> I can’t find a way to make SBT parse that structure and name.
> 
> 
> On Oct 2, 2017, at 11:02 PM, Trevor Grant <trevor.d.gr...@gmail.com>
> wrote:
> 
> Code pointer:
> https://github.com/rawkintrevo/cylons/tree/master/eigenfaces
> 
> However, I build Mahout (0.13.1-SNAPSHOT) locally with
> 
> mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests
> 
> That's how maven was able to pick those up.
> 
> 
> On Fri, Sep 22, 2017 at 10:06 PM, Hoa Nguyen <h...@insightdatascience.com>
> wrote:
> 
>> Hey all,
>> 
>> Thanks for the offers of help. I've been able to narrow down some of the
>> problems to version incompatibility and I just wanted to give an update.
>> Just to back track a bit, my initial goal was to run Mahout on a
>> distributed cluster whether that was running Hadoop Map Reduce or Spark.
>> 
>> I started out trying to get it to run on Spark, which I have some
>> familiarity, but that didn't seem to work. While the error messages seem
> to
>> indicate there weren't enough resources on the workers ("WARN
>> scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
>> check your cluster UI to ensure that workers are registered and have
>> sufficient memory"), I'm pretty sure that wasn't the case, not only
> because
>> it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
>> Spark batch job on that same distributed cluster.
>> 
>> After a bit of wrangling, I was able to narrow down some of the issues.
> It
>> turns out I was kind of blindly using this repo https://github.com/
>> pferrel/3-input-cooc as a guide without fully realizing that it was from
>> several years ago and based on Mahout 0.10.0, Scala 2.10 and Spark 1.1.1
>> That is significantly different from my environment, which has Mahout
>> 0.13.0 and Spark 2.1.1 installed, which also means I have to use Scala
>> 2.11. After modifying the build.sbt file to account for those versions, I
>> now have compile type mismatch issues that I'm just not that savvy to fix
>> (see attached screenshot if you're interested).
>> 
>> Anyway, the good news that I was able to finally get Mahout code running
>> on Hadoop map-reduce, but also after a bit wrangling. It turned out my
>> instances were running Ubuntu 14 and apparently that doesn't play well
> with
>> Hadoop 2.7.4, which prevented me from running any sample Mahout code
> (from
>> here: https://github.com/apache/mahout/tree/master/examples/bin) that
>> relied on map-reduce. Those problems went away after I installed Hadoop
>> 2.8.1 instead. Now I'm able to get the shell scripts running on a
>> distributed Hadoop cluster 

[jira] [Commented] (MAHOUT-2020) Maven repo structure compatibility with SBT

2017-10-03 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190340#comment-16190340
 ] 

Pat Ferrel commented on MAHOUT-2020:


This may be a non-issue. Trevor said in email:

{quote}The spark is included via maven classifier-

the sbt line should be

libraryDependencies += "org.apache.mahout" % "mahout-spark_2.11" %
"0.13.1-SNAPSHOT" classifier "spark_2.1"
{quote}



> Maven repo structure compatibility with SBT
> ---
>
> Key: MAHOUT-2020
> URL: https://issues.apache.org/jira/browse/MAHOUT-2020
> Project: Mahout
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.13.1
> Environment: Creating a project from maven built Mahout using sbt. 
> Made critical since it seems to block using Mahout with sbt. At least I have 
> found no way to do it.
>Reporter: Pat Ferrel
>Assignee: Trevor Grant
>Priority: Critical
> Fix For: 0.13.1
>
>
> The maven repo should build:
> org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar
> substitute Spark version for -2.1, so -1.6 etc.
> The build.sbt  `libraryDependencies` line then will be:
> `"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`
> This is parsed by sbt to yield the path of :
> org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar
> The outcome of `mvn clean install` currently is something like:
> org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-spark-0.13.1-SNAPSHOT-spark_2.1.jar
> This has no effect on the package structure, only artifact naming and maven 
> repo structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MAHOUT-2019) SparseRowMatrix assign ops user for loops instead of iterateNonZero and so can be optimized

2017-10-03 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190338#comment-16190338
 ] 

Pat Ferrel commented on MAHOUT-2019:


This may be a non-issue: 

Trevor said in email:

{quote}The spark is included via maven classifier-

the sbt line should be

libraryDependencies += "org.apache.mahout" % "mahout-spark_2.11" %
"0.13.1-SNAPSHOT" classifier "spark_2.1"


{quote}

> SparseRowMatrix assign ops user for loops instead of iterateNonZero and so 
> can be optimized
> ---
>
> Key: MAHOUT-2019
> URL: https://issues.apache.org/jira/browse/MAHOUT-2019
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.13.0
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 0.13.1
>
>
> DRMs get blockified into SparseRowMatrix instances if the density is low. But 
> SRM inherits the implementation of method like "assign" from AbstractMatrix, 
> which uses nest for loops to traverse rows. For multiplying 2 matrices that 
> are extremely sparse, the kind if data you see in collaborative filtering, 
> this is extremely wasteful of execution time. Better to use a sparse vector's 
> iterateNonZero Iterator for some function types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MAHOUT-2019) SparseRowMatrix assign ops user for loops instead of iterateNonZero and so can be optimized

2017-10-03 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-2019:
---
Priority: Major  (was: Minor)

> SparseRowMatrix assign ops user for loops instead of iterateNonZero and so 
> can be optimized
> ---
>
> Key: MAHOUT-2019
> URL: https://issues.apache.org/jira/browse/MAHOUT-2019
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.13.0
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 0.13.1
>
>
> DRMs get blockified into SparseRowMatrix instances if the density is low. But 
> SRM inherits the implementation of method like "assign" from AbstractMatrix, 
> which uses nest for loops to traverse rows. For multiplying 2 matrices that 
> are extremely sparse, the kind if data you see in collaborative filtering, 
> this is extremely wasteful of execution time. Better to use a sparse vector's 
> iterateNonZero Iterator for some function types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MAHOUT-2019) SparseRowMatrix assign ops user for loops instead of iterateNonZero and so can be optimized

2017-10-03 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-2019:
---
Priority: Minor  (was: Major)

> SparseRowMatrix assign ops user for loops instead of iterateNonZero and so 
> can be optimized
> ---
>
> Key: MAHOUT-2019
> URL: https://issues.apache.org/jira/browse/MAHOUT-2019
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.13.0
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Minor
> Fix For: 0.13.1
>
>
> DRMs get blockified into SparseRowMatrix instances if the density is low. But 
> SRM inherits the implementation of method like "assign" from AbstractMatrix, 
> which uses nest for loops to traverse rows. For multiplying 2 matrices that 
> are extremely sparse, the kind if data you see in collaborative filtering, 
> this is extremely wasteful of execution time. Better to use a sparse vector's 
> iterateNonZero Iterator for some function types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MAHOUT-2020) Maven repo structure compatibility with SBT

2017-10-03 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-2020:
--

 Summary: Maven repo structure compatibility with SBT
 Key: MAHOUT-2020
 URL: https://issues.apache.org/jira/browse/MAHOUT-2020
 Project: Mahout
  Issue Type: Bug
  Components: build
Affects Versions: 0.13.1
 Environment: Creating a project from maven built Mahout using sbt. 
Made critical since it seems to block using Mahout with sbt. At least I have 
found no way to do it.
Reporter: Pat Ferrel
Assignee: Trevor Grant
Priority: Critical
 Fix For: 0.13.1


The maven repo should build:
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar

substitute Spark version for -2.1, so -1.6 etc.

The build.sbt  `libraryDependencies` line then will be:
`"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`

This is parsed by sbt to yield the path of :
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar

The outcome of `mvn clean install` currently is something like:
org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-spark-0.13.1-SNAPSHOT-spark_2.1.jar

This has no effect on the package structure, only artifact naming and maven 
repo structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: Running Mahout on a Spark cluster

2017-10-03 Thread Pat Ferrel
Actually if you require scala 2.11 and spark 2.1 you have to use the current 
master (o.13.0 does not support these) and also can’t use sbt, unless you have 
some trick I haven’t discovered.


On Oct 3, 2017, at 12:55 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

I’m the aforementioned pferrel

@Hoa, thanks for that reference, I forgot I had that example. First don’t use 
the Hadoop part of Mahout, it is not supported and will be deprecated. The 
Spark version of cooccurrence will be supported. You find it in the 
SimilarityAnalysis object.

If you go back to the last release you should be able to make that 
https://github.com/pferrel/3-input-cooc 
<https://github.com/pferrel/3-input-cooc> work with version updates to 
Mahout-0.13.0 and dependencies. To use the latest master of Mahout, there are 
the problems listed below.


I’m having a hard time building with sbt using the mahout-spark module when I 
build that latest mahout master with `mvn clean install`. This puts the 
mahout-spark module in the local ~/.m2 maven cache. The structure doesn’t match 
what SBT expects the path and filenames to be.

The build.sbt  `libraryDependencies` line *should* IMO be:
`"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`

This is parsed by sbt to yield the path of :
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar

unfortunately the outcome of `mvn clean install` currently is (I think):
org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-spark-0.13.1-SNAPSHOT-spark_2.1.jar

I can’t find a way to make SBT parse that structure and name.


On Oct 2, 2017, at 11:02 PM, Trevor Grant <trevor.d.gr...@gmail.com> wrote:

Code pointer:
https://github.com/rawkintrevo/cylons/tree/master/eigenfaces

However, I build Mahout (0.13.1-SNAPSHOT) locally with

mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests

That's how maven was able to pick those up.


On Fri, Sep 22, 2017 at 10:06 PM, Hoa Nguyen <h...@insightdatascience.com>
wrote:

> Hey all,
> 
> Thanks for the offers of help. I've been able to narrow down some of the
> problems to version incompatibility and I just wanted to give an update.
> Just to back track a bit, my initial goal was to run Mahout on a
> distributed cluster whether that was running Hadoop Map Reduce or Spark.
> 
> I started out trying to get it to run on Spark, which I have some
> familiarity, but that didn't seem to work. While the error messages seem to
> indicate there weren't enough resources on the workers ("WARN
> scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
> check your cluster UI to ensure that workers are registered and have
> sufficient memory"), I'm pretty sure that wasn't the case, not only because
> it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
> Spark batch job on that same distributed cluster.
> 
> After a bit of wrangling, I was able to narrow down some of the issues. It
> turns out I was kind of blindly using this repo https://github.com/
> pferrel/3-input-cooc as a guide without fully realizing that it was from
> several years ago and based on Mahout 0.10.0, Scala 2.10 and Spark 1.1.1
> That is significantly different from my environment, which has Mahout
> 0.13.0 and Spark 2.1.1 installed, which also means I have to use Scala
> 2.11. After modifying the build.sbt file to account for those versions, I
> now have compile type mismatch issues that I'm just not that savvy to fix
> (see attached screenshot if you're interested).
> 
> Anyway, the good news that I was able to finally get Mahout code running
> on Hadoop map-reduce, but also after a bit wrangling. It turned out my
> instances were running Ubuntu 14 and apparently that doesn't play well with
> Hadoop 2.7.4, which prevented me from running any sample Mahout code (from
> here: https://github.com/apache/mahout/tree/master/examples/bin) that
> relied on map-reduce. Those problems went away after I installed Hadoop
> 2.8.1 instead. Now I'm able to get the shell scripts running on a
> distributed Hadoop cluster (yay!).
> 
> Anyway, if anyone has more recent and working Spark Scala code that uses
> Mahout that they can point me to, I'd appreciate it.
> 
> Many thanks!
> Hoa
> 
> On Fri, Sep 22, 2017 at 1:09 AM, Trevor Grant <trevor.d.gr...@gmail.com>
> wrote:
> 
>> Hi Hoa,
>> 
>> A few things could be happening here, I haven't run across that specific
>> error.
>> 
>> 1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x, however
>> you need to build from source (not the binaries).  You can do this by
>> downloading mahout source or cloning the repo and building with:
>> mvn clean install -Pspark-2.1,scala-2.11 -DskipTests
>> 
>> 2) Have you setup

Re: Running Mahout on a Spark cluster

2017-10-03 Thread Pat Ferrel
I’m the aforementioned pferrel

@Hoa, thanks for that reference, I forgot I had that example. First don’t use 
the Hadoop part of Mahout, it is not supported and will be deprecated. The 
Spark version of cooccurrence will be supported. You find it in the 
SimilarityAnalysis object.

If you go back to the last release you should be able to make that 
https://github.com/pferrel/3-input-cooc 
 work with version updates to 
Mahout-0.13.0 and dependencies. To use the latest master of Mahout, there are 
the problems listed below.


I’m having a hard time building with sbt using the mahout-spark module when I 
build that latest mahout master with `mvn clean install`. This puts the 
mahout-spark module in the local ~/.m2 maven cache. The structure doesn’t match 
what SBT expects the path and filenames to be.

The build.sbt  `libraryDependencies` line *should* IMO be:
`"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`

This is parsed by sbt to yield the path of :
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar

unfortunately the outcome of `mvn clean install` currently is (I think):
org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-spark-0.13.1-SNAPSHOT-spark_2.1.jar

I can’t find a way to make SBT parse that structure and name.


On Oct 2, 2017, at 11:02 PM, Trevor Grant  wrote:

Code pointer:
https://github.com/rawkintrevo/cylons/tree/master/eigenfaces

However, I build Mahout (0.13.1-SNAPSHOT) locally with

mvn clean install -Pscala-2.11,spark-2.1,viennacl-omp -DskipTests

That's how maven was able to pick those up.


On Fri, Sep 22, 2017 at 10:06 PM, Hoa Nguyen 
wrote:

> Hey all,
> 
> Thanks for the offers of help. I've been able to narrow down some of the
> problems to version incompatibility and I just wanted to give an update.
> Just to back track a bit, my initial goal was to run Mahout on a
> distributed cluster whether that was running Hadoop Map Reduce or Spark.
> 
> I started out trying to get it to run on Spark, which I have some
> familiarity, but that didn't seem to work. While the error messages seem to
> indicate there weren't enough resources on the workers ("WARN
> scheduler.TaskSchedulerImpl: Initial job has not accepted any resources;
> check your cluster UI to ensure that workers are registered and have
> sufficient memory"), I'm pretty sure that wasn't the case, not only because
> it's a 4 node cluster of m4.xlarges, I was able to run another, simpler
> Spark batch job on that same distributed cluster.
> 
> After a bit of wrangling, I was able to narrow down some of the issues. It
> turns out I was kind of blindly using this repo https://github.com/
> pferrel/3-input-cooc as a guide without fully realizing that it was from
> several years ago and based on Mahout 0.10.0, Scala 2.10 and Spark 1.1.1
> That is significantly different from my environment, which has Mahout
> 0.13.0 and Spark 2.1.1 installed, which also means I have to use Scala
> 2.11. After modifying the build.sbt file to account for those versions, I
> now have compile type mismatch issues that I'm just not that savvy to fix
> (see attached screenshot if you're interested).
> 
> Anyway, the good news that I was able to finally get Mahout code running
> on Hadoop map-reduce, but also after a bit wrangling. It turned out my
> instances were running Ubuntu 14 and apparently that doesn't play well with
> Hadoop 2.7.4, which prevented me from running any sample Mahout code (from
> here: https://github.com/apache/mahout/tree/master/examples/bin) that
> relied on map-reduce. Those problems went away after I installed Hadoop
> 2.8.1 instead. Now I'm able to get the shell scripts running on a
> distributed Hadoop cluster (yay!).
> 
> Anyway, if anyone has more recent and working Spark Scala code that uses
> Mahout that they can point me to, I'd appreciate it.
> 
> Many thanks!
> Hoa
> 
> On Fri, Sep 22, 2017 at 1:09 AM, Trevor Grant 
> wrote:
> 
>> Hi Hoa,
>> 
>> A few things could be happening here, I haven't run across that specific
>> error.
>> 
>> 1) Spark 2.x - Mahout 0.13.0: Mahout 0.13.0 WILL run on Spark 2.x, however
>> you need to build from source (not the binaries).  You can do this by
>> downloading mahout source or cloning the repo and building with:
>> mvn clean install -Pspark-2.1,scala-2.11 -DskipTests
>> 
>> 2) Have you setup spark with Kryo serialization? How you do this depends
>> on
>> if you're in the shell/zeppelin or using spark submit.
>> 
>> However, for both of these cases- it shouldn't have even run local afaik
>> so
>> the fact it did tells me you probably have gotten this far?
>> 
>> Assuming you've done 1 and 2, can you share some code? I'll see if I can
>> recreate on my end.
>> 
>> Thanks!
>> 
>> tg
>> 
>> On Thu, Sep 21, 2017 at 9:37 PM, Hoa Nguyen 
>> wrote:
>> 
>>> I apologize in advance if 

[jira] [Created] (MAHOUT-2019) SparseRowMatrix assign ops user for loops instead of iterateNonZero and so can be optimized

2017-10-02 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-2019:
--

 Summary: SparseRowMatrix assign ops user for loops instead of 
iterateNonZero and so can be optimized
 Key: MAHOUT-2019
 URL: https://issues.apache.org/jira/browse/MAHOUT-2019
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.13.0
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 0.13.1


DRMs get blockified into SparseRowMatrix instances if the density is low. But 
SRM inherits the implementation of method like "assign" from AbstractMatrix, 
which uses nest for loops to traverse rows. For multiplying 2 matrices that are 
extremely sparse, the kind if data you see in collaborative filtering, this is 
extremely wasteful of execution time. Better to use a sparse vector's 
iterateNonZero Iterator for some function types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [DISCUSS] Proposed resolution to graduate the PredictionIO podling

2017-09-29 Thread Pat Ferrel
oh, nm I found it. Pasted below, there were no dissenters to Donald’s detailed 
assessment.


On Sep 29, 2017, at 3:27 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

Actually we did go over the maturity checklist ourselves. Donald, maybe you can 
forwards the thread here.

 pasted from the thread on 
d...@predictionio.incubator.apache.org 
<mailto:d...@predictionio.incubator.apache.org> ==



Begin forwarded message:

From: Donald Szeto <don...@apache.org>
Subject: Re: Graduation to TLP
Date: September 5, 2017 at 10:32:08 AM PDT
To: d...@predictionio.incubator.apache.org
Reply-To: d...@predictionio.incubator.apache.org

Thanks for the clarification Pat! It always help to have Apache veterans to
provide historical context to these processes.

As for me, I'd like to remain as PMC and committer.

I like the idea of polling the current committers and PMC, but like you
said, most of them got pretty busy and may not be reading mailing list in a
while. Maybe let me try a shout out here and see if anyone would
acknowledge it, so that we know whether a poll will be effective.

*>> If you're a PMC or committer who see this line but hasn't been replying
this thread, please acknowledge. <<*

Regarding the maturity model, this is my perception right now:
- CD10, CD20, CD30, CD40 (and we start to have CD50 as well)
- LC10, LC20, LC30, LC40, LC50
- RE10, RE20, RE30, RE50 (I think we hope to also do RE40 with 0.12)
- QU10, QU30, QU40, QU50 (we should put a bit of focus to QU20)
- CO10, CO20, CO30, CO40, CO60, CO70 (for CO50, I think we've been
operating under the assumption that PMC and contributors are pretty
standard definitions by ASF. We can call those out explicitly.)
- CS10, CS50 (We are also assuming implicitly CS20, CS30, and CS40 from
main ASF doc)
- IN10, IN20

Let me know what you think.

On Fri, Sep 1, 2017 at 10:32 AM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:

> The Chair, PMC, and Committers may be different after graduation.
> PMC/committers are sometimes not active committers but can have a valuable
> role as mentors, in non-technical roles, as support people on the mailing
> list, or as sometimes committers who don’t seem very active but come in
> every so often to make a key contribution. So I hope this doesn’t become a
> time to prune too deeply. I’d suggest we only do that if one of the
> committers has done something to lessen our project maturity or wants to be
> left out for their own reasons. An example of bad behavior is someone
> trying to exert corporate dominance (which is severely frowned on by the
> ASF). Another would be someone who is disruptive to the point of destroying
> team effectiveness. I personally haven’t seen any of this but purposely
> don’t read everything so chime in here.
> 
> It would be good to have people declare their interest-level. As for me,
> I’d like to remain on the PMC as a committer but have no interest in Chair.
> Since people can become busy periodically and not read @dev (me?) we could,
> maybe should, poll the current committers and PMC to get the lists ready
> for the graduation proposal.
> 
> 
> Don’t forget that we are not just asking for dev community opinion about
> graduation. We are also asking that people check things like the Maturity
> Checklist to see it we are ready. http://community.apache.org/
> apache-way/apache-project-maturity-model.html <
> http://community.apache.org/apache-way/apache-project-maturity-model.html 
> <http://community.apache.org/apache-way/apache-project-maturity-model.html>>
> People seem fairly enthusiastic about applying for graduation, but are
> there things we need to do before hand? The goal is to show that we do not
> require the second level check for decisions that the IPMC provides. The
> last release required no changes but had a proviso about content licenses.
> This next release should fly through without provisos IMHO. Are there other
> things we should do?
> 
> 
> On Sep 1, 2017, at 6:16 AM, takako shimamoto <chiboch...@gmail.com> wrote:
> 
> I entirely agree with everyone else.
> I hope the PIO community will become more active after graduation.
> 
>> 2. If we are to graduate, who should we include in the list of the
> initial
>> PMC?
> 
> Don't all present IPMC members are included in the list of the initial PMC?
> 
> Personally, I think we may as well check and see if present IPMC
> members intend to become an initial PMC for graduation.
> Members who make a declaration of intent to become it will surely
> contribute to the project.
> It is a great contribution not only to develop a program but also to
> respond to email aggressively or fix document.
> 
> 
> 2017-08-29 14:20 GMT+09:00 Donald Szeto <don...@apache.

Re: [DISCUSS] Proposed resolution to graduate the PredictionIO podling

2017-09-29 Thread Pat Ferrel
oh, nm I found it. Pasted below, there were no dissenters to Donald’s detailed 
assessment.


On Sep 29, 2017, at 3:27 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

Actually we did go over the maturity checklist ourselves. Donald, maybe you can 
forwards the thread here.

 pasted from the thread on 
dev@predictionio.incubator.apache.org 
<mailto:dev@predictionio.incubator.apache.org> ==



Begin forwarded message:

From: Donald Szeto <don...@apache.org>
Subject: Re: Graduation to TLP
Date: September 5, 2017 at 10:32:08 AM PDT
To: dev@predictionio.incubator.apache.org
Reply-To: dev@predictionio.incubator.apache.org

Thanks for the clarification Pat! It always help to have Apache veterans to
provide historical context to these processes.

As for me, I'd like to remain as PMC and committer.

I like the idea of polling the current committers and PMC, but like you
said, most of them got pretty busy and may not be reading mailing list in a
while. Maybe let me try a shout out here and see if anyone would
acknowledge it, so that we know whether a poll will be effective.

*>> If you're a PMC or committer who see this line but hasn't been replying
this thread, please acknowledge. <<*

Regarding the maturity model, this is my perception right now:
- CD10, CD20, CD30, CD40 (and we start to have CD50 as well)
- LC10, LC20, LC30, LC40, LC50
- RE10, RE20, RE30, RE50 (I think we hope to also do RE40 with 0.12)
- QU10, QU30, QU40, QU50 (we should put a bit of focus to QU20)
- CO10, CO20, CO30, CO40, CO60, CO70 (for CO50, I think we've been
operating under the assumption that PMC and contributors are pretty
standard definitions by ASF. We can call those out explicitly.)
- CS10, CS50 (We are also assuming implicitly CS20, CS30, and CS40 from
main ASF doc)
- IN10, IN20

Let me know what you think.

On Fri, Sep 1, 2017 at 10:32 AM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:

> The Chair, PMC, and Committers may be different after graduation.
> PMC/committers are sometimes not active committers but can have a valuable
> role as mentors, in non-technical roles, as support people on the mailing
> list, or as sometimes committers who don’t seem very active but come in
> every so often to make a key contribution. So I hope this doesn’t become a
> time to prune too deeply. I’d suggest we only do that if one of the
> committers has done something to lessen our project maturity or wants to be
> left out for their own reasons. An example of bad behavior is someone
> trying to exert corporate dominance (which is severely frowned on by the
> ASF). Another would be someone who is disruptive to the point of destroying
> team effectiveness. I personally haven’t seen any of this but purposely
> don’t read everything so chime in here.
> 
> It would be good to have people declare their interest-level. As for me,
> I’d like to remain on the PMC as a committer but have no interest in Chair.
> Since people can become busy periodically and not read @dev (me?) we could,
> maybe should, poll the current committers and PMC to get the lists ready
> for the graduation proposal.
> 
> 
> Don’t forget that we are not just asking for dev community opinion about
> graduation. We are also asking that people check things like the Maturity
> Checklist to see it we are ready. http://community.apache.org/
> apache-way/apache-project-maturity-model.html <
> http://community.apache.org/apache-way/apache-project-maturity-model.html 
> <http://community.apache.org/apache-way/apache-project-maturity-model.html>>
> People seem fairly enthusiastic about applying for graduation, but are
> there things we need to do before hand? The goal is to show that we do not
> require the second level check for decisions that the IPMC provides. The
> last release required no changes but had a proviso about content licenses.
> This next release should fly through without provisos IMHO. Are there other
> things we should do?
> 
> 
> On Sep 1, 2017, at 6:16 AM, takako shimamoto <chiboch...@gmail.com> wrote:
> 
> I entirely agree with everyone else.
> I hope the PIO community will become more active after graduation.
> 
>> 2. If we are to graduate, who should we include in the list of the
> initial
>> PMC?
> 
> Don't all present IPMC members are included in the list of the initial PMC?
> 
> Personally, I think we may as well check and see if present IPMC
> members intend to become an initial PMC for graduation.
> Members who make a declaration of intent to become it will surely
> contribute to the project.
> It is a great contribution not only to develop a program but also to
> respond to email aggressively or fix document.
> 
> 
> 2017-08-29 14:20 GMT+09:00 Donald Szeto <don...@apache.

Re: [DISCUSS] Proposed resolution to graduate the PredictionIO podling

2017-09-29 Thread Pat Ferrel
Actually we did go over the maturity checklist ourselves. Donald, maybe you can 
forwards the thread here.


On Sep 29, 2017, at 2:04 PM, Bertrand Delacretaz  
wrote:

Hi John,

On Fri, Sep 29, 2017 at 2:59 PM, John D. Ament  wrote:
> ...I wouldn't conflate lack of mentor engagement with a project's readiness to
> graduate, though I do agree with just hearing from them...

Yes, what I was asking for is a clear statement from the mentors about
graduation readiness, as I think they are the best placed to judge
that. Andrew has provided that now.

-Bertrand

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [RESULT][VOTE] Resolution to create a TLP from graduating Incubator podling

2017-09-29 Thread Pat Ferrel
oops, I belatedly add my +1 biniding


On Sep 28, 2017, at 10:43 PM, Donald Szeto <don...@apache.org> wrote:

The vote passes, with 9 +1 votes (8 binding) and no -1 votes.

+1 Andrew Purtell (binding)
+1 Chan Lee (binding)
+1 Donald Szeto (binding)
+1 Jianhe Liao (non-PPMC, non-binding)
+1 Mars Hall (binding)
+1 Naoki Takezoe (binding)
+1 Shinsuke Sugaya (binding)
+1 Suneel Marthi (binding)
+1 Takako Shimamoto (binding)

Thanks all for voting. We will proceed to post our resolution to the
general incubator list for discussion.

On Wed, Sep 27, 2017 at 9:54 PM, takako shimamoto <chiboch...@gmail.com>
wrote:

> +1 binding
> 
> Thanks, Donald! I really appreciate that.
> 
> 
> 2017-09-26 12:50 GMT+09:00 Donald Szeto <don...@apache.org>:
>> Hi all,
>> 
>> Based on previous discussions (
>> https://lists.apache.org/thread.html/2b4ef7c394584988cf0c99920824af
> aa60ee4c648d5c0069b1bf55c0@%3Cdev.predictionio.apache.org%3E
>> and
>> https://lists.apache.org/thread.html/1b06e510773ee1d315728e0ce25f22
> 0c9cf7d9e8ad601ec9dba4fe1d@%3Cdev.predictionio.apache.org%3E),
>> I would like to start a formal vote on graduating PredictionIO from an
>> Incubator podling to a top level project with the following resolution.
>> This thread will be forwarded to the Incubator general mailing list.
>> 
>> Once again, Salesforce has already signed and executed an assignment
>> agreement to assign the PredictionIO mark to ASF.
>> 
>> The graduation process we are following is described here:
>> http://incubator.apache.org/guides/graduation.html
>> 
>> Once this vote passes, a discussion will be started on Incubator general,
>> followed by a vote when a consensus there would be arrived. The vote will
>> run for at least 72 hours before closing at 9PM PST on 9/28/2017.
>> 
>> Thank you all! Let's graduate.
>> 
>> +1 (binding) from me.
>> 
>> Regards,
>> Donald
>> 
>> -
>> 
>>X. Establish the Apache PredictionIO Project
>> 
>>   WHEREAS, the Board of Directors deems it to be in the best
>>   interests of the Foundation and consistent with the
>>   Foundation's purpose to establish a Project Management
>>   Committee charged with the creation and maintenance of
>>   open-source software, for distribution at no charge to
>>   the public, related to a machine learning server built on top of
>>   state-of-the-art open source stack, that enables developers to
> manage
>>   and deploy production-ready predictive services for various kinds
> of
>>   machine learning tasks.
>> 
>>   NOW, THEREFORE, BE IT RESOLVED, that a Project Management
>>   Committee (PMC), to be known as the "Apache PredictionIO Project",
>>   be and hereby is established pursuant to Bylaws of the
>>   Foundation; and be it further
>> 
>>   RESOLVED, that the Apache PredictionIO Project be and hereby is
>>   responsible for the creation and maintenance of software
>>   related to a machine learning server built on top of
>>   state-of-the-art open source stack, that enables developers to
> manage
>>   and deploy production-ready predictive services for various kinds
> of
>>   machine learning tasks;
>>   and be it further
>> 
>>   RESOLVED, that the office of "Vice President, Apache
> PredictionIO" be
>>   and hereby is created, the person holding such office to
>>   serve at the direction of the Board of Directors as the chair
>>   of the Apache PredictionIO Project, and to have primary
>> responsibility
>>   for management of the projects within the scope of
>>   responsibility of the Apache PredictionIO Project; and be it
> further
>> 
>>   RESOLVED, that the persons listed immediately below be and
>>   hereby are appointed to serve as the initial members of the
>>   Apache PredictionIO Project:
>> 
>> * Alex Merritt <emergentor...@apache.org>
>> * Andrew Kyle Purtell <apurt...@apache.org>
>> * Chan Lee <chan...@apache.org>
>> * Donald Szeto <don...@apache.org>
>> * Felipe Oliveira <fel...@apache.org>
>> * James Taylor <jtay...@apache.org>
>> * Justin Yip <yipjus...@apache.org>
>> * Kenneth Chan <kenn...@apache.org>
>> * Lars Hofhansl <la...@apache.org>
>> * Lee Moon Soo <m...@apache.org>
>> * Luciano Resende <lrese...@apache.org>
>> * Marcin 

Re: Eventserver API in an Engine?

2017-09-23 Thread Pat Ferrel
 replies:

You will have to spread the pio “workflow” out over a permanent 
deploy+eventserver machine. I usually call this a combo PredictionServer and 
EventServe. These are 2 JVM processes the take events and respond to queries 
and so must be available all the time. You will run `pio eventserver` and `pio 
deploy` on this machine.

This is exactly what I'm talking about. Two processes on a single machine to 
run a complete deployment. Doesn't it make sense to allow these APIs to coexist 
in a single JVM?

Sure, in some cases you may want to scale out and tune two different JVMs for 
these two different use-cases, but for most of us, making it so the main 
runtime only requires a single process/JVM would make PredictionIO much more 
friendly to operate.

A few more comments inline below…


On Wed, Jul 12, 2017 at 7:43 PM, Kenneth Chan <kenn...@apache.org 
<mailto:kenn...@apache.org>> wrote:
Mars, i totally understand and agree we should make developer successful. but 
Would like to understand your problem more before jump into conclusion

first, a complete PIO setup has following:
1. PIO framework layer
2. PIO administration (e.g. PIO app)
3. PIO event server 
4. one or more PIO engines

the storage and setup config applied to 1 globally and the rest 2, 3, 4 would 
run on top of 1.

my understanding is that the Buildpack would take engine code and then build, 
release and deploy it which can then serve query.

when heroku user  use buildpack, 
- Where is the event server in the picture?

The eventserver is considered optional. If a Heroku user wants to use events 
API, then they must provision a second Heroku app for the eventserver:
  
https://github.com/heroku/predictionio-buildpack/blob/master/CUSTOM.md#user-content-eventserver
 
<https://github.com/heroku/predictionio-buildpack/blob/master/CUSTOM.md#user-content-eventserver>
 
- How user setup the storage config for 1?

With the Heroku buildpack, PostgreSQL is the default for all storage sources, 
and it is automatically configured.
 
- if i use build pack to deploy another engine, does it share 1 and 2 above?

No. Every engine is another Heroku app. Every eventserver is another Heroku 
app. These can be configured to intentionally share databases/storage, such as 
for a specific engine+eventserver pair.

 
On Wed, Jul 12, 2017 at 3:21 PM, Mars Hall <m...@heroku.com 
<mailto:m...@heroku.com>> wrote:
The key motivation behind this idea/request is to:

Simplify baseline PredictionIO deployment, both conceptually & technically.

My vision with this thread is to:

Enable single-process, single network-listener PredictionIO app deployment
(i.e. Queries & Events APIs in the same process.)


Attempting to address some previous questions & statements…


From Pat Ferrel on Tue, 11 Jul 2017 10:53:48 -0700 (PDT):
> how much of your problem is workflow vs installation vs bundling of APIs? Can 
> you explain it more?

I am focused on deploying PredictionIO on Heroku via this buildpack:
  https://github.com/heroku/predictionio-buildpack 
<https://github.com/heroku/predictionio-buildpack>

Heroku is an app-centric platform, where each app gets a single routable 
network port. By default apps get a URL like:
  https://tdx-classi.herokuapp.com <https://tdx-classi.herokuapp.com/> (an 
example PIO Classification engine)

Deploying a separate Eventserver app that must be configured to share storage 
config & backends leads to all kinds of complexity, especially when 
unsuspectingly a developer might want to deploy a new engine with a different 
storage config but not realize that Eventserver is not simply shareable. 
Despite a lot of docs & discussion suggesting its share-ability, there is 
precious little documentation that presents how the multi-backend Storage 
really works in PIO. (I didn't understand it until I read a bunch of Storage 
source code.)


From Kenneth Chan on Tue, 11 Jul 2017 12:49:58 -0700 (PDT):
> For example, one can modify the classification to train a classifier on the 
> same set of data used by recommendation.
…and later on Wed, 12 Jul 2017 13:44:01 -0700:
> My concern of embedding event server in engine is
> - what problem are we solving by providing an illusion that events are only 
> limited for one engine?

This is a great ideal target, but the reality is that it takes some significant 
design & engineering to reach that level of data share-ability. I'm not 
suggesting that we do anything to undercut the possibilities of such a 
distributed architecture. I suggest that we streamline PIO for everyone that is 
not at that level of distributed architecture. Make PIO not *require* it.

The best example I have is that you can run Spark in local mode, without 
worrying about any aspect of its ideal distributed purpose. (In fact 
PredictionIO is built on this feature of Spark!) I don't know the history 
there, but would imagine Spark was not alwa

Re: How to training and deploy on different machine?

2017-09-21 Thread Pat Ferrel
We do deployments and customize things for users. When we deploy PredictionIO 
we typically have one machine that is for only PIO permanent servers. It runs 
the PredictionServer (started with `pio deploy`) and the EventServer (started 
with `pio eventserver`). These services communicate with Elasticsearch and 
HBase. We usually have the DB (Hbase) and Elasticsearch on separate machines. 
They are under heavy load in production and during training so having them 
separate allows you to scale as needed. 

Spark is the oddball because it can be temporary. Here the minimum is 2 
machines, one for the Spark driver (launched with `pio train`) and at least one 
Spark executor machine with Spark installed but nothing else.

This means PIO is installed on the EventServer + PredictionServer machine and 
the Spark driver machine, so in 2 places. The other services can be put 
wherever you want.

The temporary machines are the Spark driver and Spark Executor(s). Since PIO is 
installed on the driver machine you will want to save config but “stopping” 
instead of "deleting” the instance.



On Sep 20, 2017, at 8:30 PM, Brian Chiu <br...@snaptee.co> wrote:

Dear Pat,

Thanks for the detailed guide.  It is nice to know it is possible.
But I am not sure if I understand it correctly, so could you please
point out any misunderstanding in the following?  (If there is any)


Let's say I have 3 machines.

There is a machine [EventServer and data store) for ES, HBase+HDFS (or
Postgres, but not recommended)
The other 2 machines will both connect to this machine.
It is permanent.

machine [TrainingServer] will run `pio build` and `pio train`
This step pull training data from [EventServer] and then store model
and metadata back,
It is not permanent.

machine [PredictionServer] gets a copy of the template from machine
[TrainingServer] (only need to do this once)
Then run `pio deploy`
It is not a Spark driver or executor for training
Write a cron job of `pio deploy`
It is permanent.


Thanks

Brian

On Wed, Sep 20, 2017 at 11:16 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> Yes, this is the recommended config (Postgres is not, but later). Spark is
> only needed during training but the `pio train` process creates drives and
> executors in Spark. The driver will be the `pio train` machine so you must
> install pio on it. You should have 2 Spark machines at least because the
> driver and executor need roughly the same memory, more executors will train
> faster.
> 
> You will have to spread the pio “workflow” out over a permanent
> deploy+eventserver machine. I usually call this a combo PredictionServer and
> EventServe. These are 2 JVM processes the take events and respond to queries
> and so must be available all the time. You will run `pio eventserver` and
> `pio deploy` on this machine. the Spark driver machine will run `pio train`.
> Since no state is stored in PIO this will work because the machines get
> state from the DBs (HBase is recommended, and Elasticsearch). Install pio
> and the UR in the same location on all machines because the path to the UR
> is used by PIO to give an id to the engine (not ideal, but oh well).
> 
> Once setup:
> 
> Run `pio eventserver` on the permanent PS/ES machine and input your data
> into the EventServer.
> Run `pio build` on the “driver” machine and `pio train` on the same machine.
> This build the UR, puts metadata about the instance in PIO and creates the
> Spark driver, which can use a separate machine or 3 as Spark executors.
> Then copy the UR directory to the PS/ES machine and do `pio deploy` from the
> copied directory.
> Shut down the driver machine and Spark executors. For AWS “stopping" them
> means config is saved so you only pay for EBS storage. You will start them
> before the next train.
> 
> 
> From then on there is no need to copy the UR directory, just spin up the
> driver and any other Spark machine, do `pio train` and you are done. The
> model is automatically hot-swapped with the old one with no downtime and no
> need to re-deploy.
> 
> This will only work in this order if you want to take advantage of a
> temporary Spark. PIO is installed on the PS/ES machine and the “driver”
> machine in exactly the same way connecting to the same stores.
> 
> Hmm, I should write a How to for this...
> 
> 
> 
> On Sep 20, 2017, at 3:23 AM, Brian Chiu <br...@snaptee.co> wrote:
> 
> Hi,
> 
> I would like to be able to train and run model on different machines.
> The reason is, on my dataset, training takes around 16GB of memory and
> deploying only needs 8GB.  In order to save money, it would be better
> if only a 8GB memory machine is used in production, and only start a
> 16GB one perhaps weekly for training.  Is it possible with
> predictionIO + universal recommender?
> 
> I have 

Re: How to training and deploy on different machine?

2017-09-20 Thread Pat Ferrel
Yes, this is the recommended config (Postgres is not, but later). Spark is only 
needed during training but the `pio train` process creates drives and executors 
in Spark. The driver will be the `pio train` machine so you must install pio on 
it. You should have 2 Spark machines at least because the driver and executor 
need roughly the same memory, more executors will train faster.

You will have to spread the pio “workflow” out over a permanent 
deploy+eventserver machine. I usually call this a combo PredictionServer and 
EventServe. These are 2 JVM processes the take events and respond to queries 
and so must be available all the time. You will run `pio eventserver` and `pio 
deploy` on this machine. the Spark driver machine will run `pio train`. Since 
no state is stored in PIO this will work because the machines get state from 
the DBs (HBase is recommended, and Elasticsearch). Install pio and the UR in 
the same location on all machines because the path to the UR is used by PIO to 
give an id to the engine (not ideal, but oh well). 

Once setup:
Run `pio eventserver` on the permanent PS/ES machine and input your data into 
the EventServer.
Run `pio build` on the “driver” machine and `pio train` on the same machine. 
This build the UR, puts metadata about the instance in PIO and creates the 
Spark driver, which can use a separate machine or 3 as Spark executors.
Then copy the UR directory to the PS/ES machine and do `pio deploy` from the 
copied directory.
Shut down the driver machine and Spark executors. For AWS “stopping" them means 
config is saved so you only pay for EBS storage. You will start them before the 
next train.

From then on there is no need to copy the UR directory, just spin up the driver 
and any other Spark machine, do `pio train` and you are done. The model is 
automatically hot-swapped with the old one with no downtime and no need to 
re-deploy.

This will only work in this order if you want to take advantage of a temporary 
Spark. PIO is installed on the PS/ES machine and the “driver” machine in 
exactly the same way connecting to the same stores.

Hmm, I should write a How to for this...



On Sep 20, 2017, at 3:23 AM, Brian Chiu  wrote:

Hi,

I would like to be able to train and run model on different machines.
The reason is, on my dataset, training takes around 16GB of memory and
deploying only needs 8GB.  In order to save money, it would be better
if only a 8GB memory machine is used in production, and only start a
16GB one perhaps weekly for training.  Is it possible with
predictionIO + universal recommender?

I have done some search and found a related guide here:
https://github.com/actionml/docs.actionml.com/blob/master/pio_load_balancing.md
Which copy the whole template directory and then run pio deploy.  But
in their case HBase and elasticsearch cluster are used.  In my case
only a single machine is used with elasticsearch and postgresql.  Will
this work?  (I am flexible about using postresql or localfs or hbase,
but I cannot afford a cluster)

Perhaps another solution to make the 16GB machine as a spark slave,
start it before training start, and the 8GB machine will connect to
it. Then call pio train; pio deploy on the 8GB machine.  Finally
shutdown the 16GB machine.  But I have no idea if it can work.  And if
yes, is there any documentation I can look into?

Any other method is welcome!  Zero downtime is preferred but not necessary.

Thanks in advance.


Best Regards,
Brian



Re: Unable to connect to all storage backends successfully

2017-09-20 Thread Pat Ferrel
meaning is “firstcluster” the cluster name in your Elasticsearch configuration?


On Sep 19, 2017, at 8:54 PM, Vaghawan Ojha  wrote:

I think the problem is with Elasticsearch, are you sure the cluster exists in 
elasticsearch configuration? 

On Wed, Sep 20, 2017 at 8:17 AM, Jim Miller > wrote:
Hi,

I’m using PredictionIO 0.12.0-incubating with ElasticSearch and Hbase:
PredictionIO-0.12.0-incubating/vendors/elasticsearch-1.4.4
PredictionIO-0.12.0-incubating/vendors/hbase-1.0.0
PredictionIO-0.12.0-incubating/vendors/spark-1.5.1-bin-hadoop2.6

All starts with no errors but with pio status I get:

[INFO] [Management$] Inspecting PredictionIO...
[INFO] [Management$] PredictionIO 0.12.0-incubating is installed at 
/home/vagrant/pio/PredictionIO-0.12.0-incubating
[INFO] [Management$] Inspecting Apache Spark...
[INFO] [Management$] Apache Spark is installed at 
/home/vagrant/pio/PredictionIO-0.12.0-incubating/vendors/spark-1.5.1-bin-hadoop2.6
[INFO] [Management$] Apache Spark 1.5.1 detected (meets minimum requirement of 
1.3.0)
[INFO] [Management$] Inspecting storage backend connections...
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
[ERROR] [Management$] Unable to connect to all storage backends successfully.
The following shows the error message from the storage backend.

Connection closed 
(org.apache.predictionio.shaded.org.apache.http.ConnectionClosedException)

Dumping configuration of initialized storage backend sources.
Please make sure they are correct.

Source Name: ELASTICSEARCH; Type: elasticsearch; Configuration: HOME -> 
/home/vagrant/pio/PredictionIO-0.12.0-incubating/vendors/elasticsearch-1.4.4, 
HOSTS -> localhost, PORTS -> 9300, CLUSTERNAME -> firstcluster, TYPE -> 
elasticsearch

Can anyone give me an idea of what I need to fix this issue?  Here is 

# PredictionIO Main Configuration
#
# This section controls core behavior of PredictionIO. It is very likely that
# you need to change these to fit your site.

# SPARK_HOME: Apache Spark is a hard dependency and must be configured.
# SPARK_HOME=$PIO_HOME/vendors/spark-2.0.2-bin-hadoop2.7
SPARK_HOME=$PIO_HOME/vendors/spark-1.5.1-bin-hadoop2.6

POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.0.0.jar
MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar

# ES_CONF_DIR: You must configure this if you have advanced configuration for
#  your Elasticsearch setup.
ES_CONF_DIR=$PIO_HOME/vendors/elasticsearch-1.4.4/conf

# HADOOP_CONF_DIR: You must configure this if you intend to run PredictionIO
#  with Hadoop 2.
HADOOP_CONF_DIR=$PIO_HOME/vendors/spark-1.5.1-bin-hadoop2.6/conf

# HBASE_CONF_DIR: You must configure this if you intend to run PredictionIO
# with HBase on a remote cluster.
HBASE_CONF_DIR=$PIO_HOME/vendors/hbase-1.0.0/conf

# Filesystem paths where PredictionIO uses as block storage.
PIO_FS_BASEDIR=$HOME/.pio_store
PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

# PredictionIO Storage Configuration
#
# This section controls programs that make use of PredictionIO's built-in
# storage facilities. Default values are shown below.
#
# For more information on storage configuration please refer to
# http://predictionio.incubator.apache.org/system/anotherdatastore/ 


# Storage Repositories

# Default is to use PostgreSQL
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS

# Storage Data Sources

# PostgreSQL Default Settings
# Please change "pio" to your database name in PIO_STORAGE_SOURCES_PGSQL_URL
# Please change PIO_STORAGE_SOURCES_PGSQL_USERNAME and
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD accordingly
# PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc
# PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio
# PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio
# PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio

# MySQL Example
# PIO_STORAGE_SOURCES_MYSQL_TYPE=jdbc
# PIO_STORAGE_SOURCES_MYSQL_URL=jdbc:mysql://localhost/pio
# PIO_STORAGE_SOURCES_MYSQL_USERNAME=pio
# PIO_STORAGE_SOURCES_MYSQL_PASSWORD=pio

# Elasticsearch Example
# PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
# PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
# PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200
# PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http
# PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch-5.5.2
# Optional basic HTTP auth
# PIO_STORAGE_SOURCES_ELASTICSEARCH_USERNAME=my-name
# PIO_STORAGE_SOURCES_ELASTICSEARCH_PASSWORD=my-secret
# Elasticsearch 1.x Example
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=firstcluster

[jira] [Closed] (PIO-32) create component upgrade releases

2017-09-19 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIO-32?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel closed PIO-32.
-
Resolution: Fixed

> create component upgrade releases
> -
>
> Key: PIO-32
> URL: https://issues.apache.org/jira/browse/PIO-32
> Project: PredictionIO
>  Issue Type: New Feature
>  Components: Core
>    Reporter: Pat Ferrel
>  Labels: gsoc2017
>
> Create a method for component upgrades that break binary compatibility like 
> Spark 2.x, Scala 2.11, and those that require source changes like 
> Elasticsearch 2.x
> If not 2 release branches then someone needs to propose an alternative. Maven 
> profiles would still require different versions of PIO source to be used for 
> ES 2.x--not sure about other upgrades. Profiles are fine for different 
> dependency libs. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [VOTE] Apache PredictionIO (incubating) 0.12.0 Release (RC2)

2017-09-14 Thread Pat Ferrel
The last release was hung up by the IPMC regarding content licensing issues and 
libraries used by the doc site, which we promised to address in this release. 
Have these been resolved, don’t recall the specifics? It would be great to fly 
through the IPMC vote without issue.


On Sep 14, 2017, at 2:06 PM, Chan Lee  wrote:

This is the vote for 0.12.0 of Apache PredictionIO (incubating).

The vote will run for at least 72 hours and will close on Sep 17th, 2017.

The release candidate artifacts can be downloaded here:
https://dist.apache.org/repos/dist/dev/incubator/predi
ctionio/0.12.0-incubating-rc2

Test results of RC1 can be found here: https://travis-ci.org/ap
ache/incubator-predictionio/builds/275634960

Maven artifacts are built from the release candidate artifacts above, and
are provided as convenience for testing with engine templates. The Maven
artifacts are provided at the Maven staging repo here:
https://repository.apache.org/content/repositories/orgapachepredictionio-1020

All JIRAs completed for this release are tagged with 'FixVersion =
0.12.0-incubating'. You can view them here: https://issues.apache.or
g/jira/secure/ReleaseNote.jspa?version=12340591=12320420

The artifacts have been signed with Key: ytX8GpWv

Please vote accordingly:

[ ] +1, accept RC as the official 0.12.0 release
[ ] -1, do not accept RC as the official 0.12.0 release because...



Re: Universal Recommender - search by subtext/Unicode

2017-09-13 Thread Pat Ferrel
1) There is no specification, all input is expected to be encoded in UTF-8, 
what encoding are you using, UTF-8 is by far the standard way that most all 
tools expect.

2) who is making the spelling mistake? The event sent to the Universal 
recommender does not get spell checked and in many cases you will send SKU 
numbers instead of real product names. You don’t “look for items” you query 
with a user-id to get user-based recs and an item-id for item-based recs, the 
spelling must match what was sent with input events but that can be any string.


On Sep 13, 2017, at 1:53 PM, Saarthak Chandra <chandra.saart...@gmail.com> 
wrote:

Hi,

1) Yes, of course. Use UTF-8 encoding. - where do I specify this??

2) I don’t understand this question. The UR is not a search engine, what kind 
of recommendation are looking for? The best recommendation from a list of 
items? The best recommendation that contains some text in the “subtext”? Not 
sure what you mean by “subtext” - I mean if the I look for an item like "ifone" 
instead of "iphone", ie. make an error in the spelling, would it still work ??

On Wed, Sep 13, 2017 at 1:37 PM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
1) Yes, of course. Use UTF-8 encoding.

2) I don’t understand this question. The UR is not a search engine, what kind 
of recommendation are looking for? The best recommendation from a list of 
items? The best recommendation that contains some text in the “subtext”? Not 
sure what you mean by “subtext”

Your query below would not work but if you wish to supply a list of items and 
get similar items then use an Item-set query. To get complimentary items like 
for shopping carts, you need to input and train differently but this is also a 
fully supported feature. 


On Sep 13, 2017, at 1:09 PM, Saarthak Chandra <chandra.saart...@gmail.com 
<mailto:chandra.saart...@gmail.com>> wrote:

Hi,

1> Can we use the universal with Unicode characters ?? (eg: User product names 
are in Korean language)

2> Can we search theu UR for products based on subtext, eg: 
curl -H "Content-Type: application/json" -d '
{
"item": ["**itemI",”***item2”]
}'

Thanks!
-- 
Saarthak Chandra,
Masters in Computer Science,
Cornell University.

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com 
<mailto:actionml-user+unsubscr...@googlegroups.com>.
To post to this group, send email to actionml-u...@googlegroups.com 
<mailto:actionml-u...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/CAJHqc1qm_AzxJwYJfegNsCref2Gkv2OfBuovfZxyTfWACBijwA%40mail.gmail.com
 
<https://groups.google.com/d/msgid/actionml-user/CAJHqc1qm_AzxJwYJfegNsCref2Gkv2OfBuovfZxyTfWACBijwA%40mail.gmail.com?utm_medium=email_source=footer>.
For more options, visit https://groups.google.com/d/optout 
<https://groups.google.com/d/optout>.




-- 
Saarthak Chandra ,
Masters in Computer Science,
Cornell University.



Re: Universal Recommender : seasonality of product

2017-09-13 Thread Pat Ferrel
This is done with blacklisting. The default config blacklists all items in the 
training data that the users has taken the primary event on. So if your primary 
event is “buy” then once a user has bought a particular table they will not be 
recommended that table again until the “buy” event ages out of the data. If you 
maintain 1 year of data, then a year after the user bought the particular table 
they might get a recommendation for it. In other words the blacklist expires 
with the last event in the input data.


On Sep 13, 2017, at 1:36 PM, Saarthak Chandra  
wrote:

Hi, 

Is there a way to add seasonality of products:
eg: If I buy a table now, I would not want to recommend tables again for the 
next year. 

So how could I include this seasonality into making recommendations?

Thanks.
-- 
Saarthak Chandra,
Masters in Computer Science,
Cornell University.

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com 
.
To post to this group, send email to actionml-u...@googlegroups.com 
.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/CAJHqc1p-tX7U5_uCshOAPWNOPY_%2B9w%3DY7zhuXAxGy%3DamsxWwAw%40mail.gmail.com
 
.
For more options, visit https://groups.google.com/d/optout 
.



Re: Universal Recommender - search by subtext/Unicode

2017-09-13 Thread Pat Ferrel
1) Yes, of course. Use UTF-8 encoding.

2) I don’t understand this question. The UR is not a search engine, what kind 
of recommendation are looking for? The best recommendation from a list of 
items? The best recommendation that contains some text in the “subtext”? Not 
sure what you mean by “subtext”

Your query below would not work but if you wish to supply a list of items and 
get similar items then use an Item-set query. To get complimentary items like 
for shopping carts, you need to input and train differently but this is also a 
fully supported feature. 


On Sep 13, 2017, at 1:09 PM, Saarthak Chandra  
wrote:

Hi,

1> Can we use the universal with Unicode characters ?? (eg: User product names 
are in Korean language)

2> Can we search theu UR for products based on subtext, eg: 
curl -H "Content-Type: application/json" -d '
{
"item": ["**itemI",”***item2”]
}'

Thanks!
-- 
Saarthak Chandra,
Masters in Computer Science,
Cornell University.

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com 
.
To post to this group, send email to actionml-u...@googlegroups.com 
.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/CAJHqc1qm_AzxJwYJfegNsCref2Gkv2OfBuovfZxyTfWACBijwA%40mail.gmail.com
 
.
For more options, visit https://groups.google.com/d/optout 
.



Re: Graduation to TLP

2017-09-07 Thread Pat Ferrel
This has been an informal poll and it looks like people are ready. I suggest we 
push for graduation after the next release, which will be done by someone not 
Donald, I think we have 2 volunteers? I think this will be a requirement since 
it’s been mentioned by several IPMC members. 

I’d like to think several people could be our candidate VP but since most of 
them are too busy and since we have another great candidate in Donald, I’d like 
to nominate him for TLP Chair/VP.

I’d suggest we poll the committers and PMC members to see if any want out of 
the TLP, and otherwise go with the current list. We should try to add any 
committers that are ready before the graduation push, the more the better to 
the IPMC.

We should put this in a proposal and get mentors feedback before applying since 
mentors are also IPMC members.

Andy has mentioned several choices for convention that we should discuss, like 
our choice of git flow for commit process. He mentioned rotating Chair, which 
seems better suited to a larger project IMO but please chime in if you like the 
idea.

If that is all clear we have to release, have a podling vote, then have the 
IPMC vote. If there is anything else regarding how we are run speak up now.


On Sep 7, 2017, at 5:01 AM, takako shimamoto <chiboch...@gmail.com> wrote:

I'd like to remain as committer and contribute my humble efforts to
the prosperity of the project.

> I propose we stay with the current PMC and committer list unless someone 
> wants to remove themselves.

It may be good. In fact, most of committers carry out a task with
limited time. Anyway I hope the project will progress in a good
direction.



2017-09-06 3:43 GMT+09:00 Pat Ferrel <p...@occamsmachete.com>:
> I personally don’t see much benefit in removing people unless they prove the 
> exception. AFAIK this generally does not happen in ASF. I’m certainly not 
> aware of the process except that it is easier in moving from podling to TLP.  
> You prove some worthiness and once that’s done, it’s done. A poll might just 
> ask project members if they want to be removed. I have seen people ask to be 
> removed from PMC and also “go emeritus” and those are cases of the 
> individuals making the choice.
> 
> So to settle the role call issue I propose we stay with the current PMC and 
> committer list unless someone wants to remove themselves.
> 
> As to maturity I agree with Donald that the checklist is heavy in our favor.
> 
> 
> On Sep 5, 2017, at 11:16 AM, Simon Chan <si...@salesforce.com> wrote:
> 
> +1 for graduation
> 
> On Tue, Sep 5, 2017 at 10:32 AM, Donald Szeto <don...@apache.org> wrote:
> 
>> Thanks for the clarification Pat! It always help to have Apache veterans to
>> provide historical context to these processes.
>> 
>> As for me, I'd like to remain as PMC and committer.
>> 
>> I like the idea of polling the current committers and PMC, but like you
>> said, most of them got pretty busy and may not be reading mailing list in a
>> while. Maybe let me try a shout out here and see if anyone would
>> acknowledge it, so that we know whether a poll will be effective.
>> 
>> *>> If you're a PMC or committer who see this line but hasn't been replying
>> this thread, please acknowledge. <<*
>> 
>> Regarding the maturity model, this is my perception right now:
>> - CD10, CD20, CD30, CD40 (and we start to have CD50 as well)
>> - LC10, LC20, LC30, LC40, LC50
>> - RE10, RE20, RE30, RE50 (I think we hope to also do RE40 with 0.12)
>> - QU10, QU30, QU40, QU50 (we should put a bit of focus to QU20)
>> - CO10, CO20, CO30, CO40, CO60, CO70 (for CO50, I think we've been
>> operating under the assumption that PMC and contributors are pretty
>> standard definitions by ASF. We can call those out explicitly.)
>> - CS10, CS50 (We are also assuming implicitly CS20, CS30, and CS40 from
>> main ASF doc)
>> - IN10, IN20
>> 
>> Let me know what you think.
>> 
>> On Fri, Sep 1, 2017 at 10:32 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> 
>>> The Chair, PMC, and Committers may be different after graduation.
>>> PMC/committers are sometimes not active committers but can have a
>> valuable
>>> role as mentors, in non-technical roles, as support people on the mailing
>>> list, or as sometimes committers who don’t seem very active but come in
>>> every so often to make a key contribution. So I hope this doesn’t become
>> a
>>> time to prune too deeply. I’d suggest we only do that if one of the
>>> committers has done something to lessen our project maturity or wants to
>> be
>>> left out for their own reasons. An example of bad behavior is someone
>>> trying to 

Re: Validate the built model

2017-09-06 Thread Pat Ferrel
We do cross-validation tests to see how well the model predicts actual 
behavior. As to the best data mix, cross-validation works with any engine 
tuning or data input. Typically this requires re-traiing between test runs so 
make sure you use exatly the same training/test split. If you want to examine 
the usefulness of different events you can compare event-type 1 to event type 1 
+ event type 2 etc. This is made easier by inputting all events, then using a 
test trick in the UR to mask out any combination of events for the 
cross-validation, using the single existing model so no need to re-train for 
this type of analysis. We have an un-supported script that does this but I warn 
you that you are on your own using it. 

https://github.com/actionml/analysis-tools 



On Sep 6, 2017, at 6:15 AM, Saarthak Chandra  wrote:

Hi,

With the Universal Recommender,

1. How can we validate the model after we train and deploy it?

2. How can we find an appropriate method of data mixing ??

Thanks
-- 
Saarthak Chandra,
Masters in Computer Science,
Cornell University.

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com 
.
To post to this group, send email to actionml-u...@googlegroups.com 
.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/CAJHqc1rMSDD6w1WGxKkHqvVUGY9%2B3RfOOdtmUqY6C3Ew361TfA%40mail.gmail.com
 
.
For more options, visit https://groups.google.com/d/optout 
.



Re: Train a model without stopping

2017-09-06 Thread Pat Ferrel
The UR does this automatically. Once deployed you never have to deploy a second 
time. When a new `pio train` happens the new model is hot-swapped to replace 
the old, which is then erased, so there is no re-deploy and no downtime.

Yes, it uses Elasticsearch aliases but most other Templates do not use 
Elasticsearch for their model storage. However I believe that some could employ 
the same hot-swap method to re-deploy, they just weren’t written that way. 
You’d have to say which Template you are using.


On Sep 6, 2017, at 12:19 AM, Paul-Armand Verhaegen 
 wrote:

I believe there are 2 main methods:

1. stop serving a couple of seconds while deploying the newly trained model, 
this is supported from pio as is.
2. make a more flexible solution that can route traffic differently or cache 
results. We have a reverse proxy (openresty / nginx + lua) in front, so that we 
can do both if business requires it. 

When working with UR: Another solution would be to utilise ES aliases

I'm pretty sure other people have thought of other solutions, but it mostly 
depends on the exact use case.
I hope that helps.

Paul

> On 6 Sep 2017, at 03:34, Saarthak Chandra  wrote:
> 
> Hi,
> 
> Is there a way we can train a model without having to stop serving.
> I mean, if I have an app deployed, can I add/post new data to the event 
> server and train the same app without stopping it?
> 
> 
> Thanks!
> -- 
> Saarthak Chandra,
> Masters in Computer Science,
> Cornell University.

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/0DECF939-9B44-40DD-A862-4E3C0AB9B6A1%40gmail.com
 
.
For more options, visit https://groups.google.com/d/optout 
.



Re: Graduation to TLP

2017-09-05 Thread Pat Ferrel
I personally don’t see much benefit in removing people unless they prove the 
exception. AFAIK this generally does not happen in ASF. I’m certainly not aware 
of the process except that it is easier in moving from podling to TLP.  You 
prove some worthiness and once that’s done, it’s done. A poll might just ask 
project members if they want to be removed. I have seen people ask to be 
removed from PMC and also “go emeritus” and those are cases of the individuals 
making the choice. 

So to settle the role call issue I propose we stay with the current PMC and 
committer list unless someone wants to remove themselves. 

As to maturity I agree with Donald that the checklist is heavy in our favor.


On Sep 5, 2017, at 11:16 AM, Simon Chan <si...@salesforce.com> wrote:

+1 for graduation

On Tue, Sep 5, 2017 at 10:32 AM, Donald Szeto <don...@apache.org> wrote:

> Thanks for the clarification Pat! It always help to have Apache veterans to
> provide historical context to these processes.
> 
> As for me, I'd like to remain as PMC and committer.
> 
> I like the idea of polling the current committers and PMC, but like you
> said, most of them got pretty busy and may not be reading mailing list in a
> while. Maybe let me try a shout out here and see if anyone would
> acknowledge it, so that we know whether a poll will be effective.
> 
> *>> If you're a PMC or committer who see this line but hasn't been replying
> this thread, please acknowledge. <<*
> 
> Regarding the maturity model, this is my perception right now:
> - CD10, CD20, CD30, CD40 (and we start to have CD50 as well)
> - LC10, LC20, LC30, LC40, LC50
> - RE10, RE20, RE30, RE50 (I think we hope to also do RE40 with 0.12)
> - QU10, QU30, QU40, QU50 (we should put a bit of focus to QU20)
> - CO10, CO20, CO30, CO40, CO60, CO70 (for CO50, I think we've been
> operating under the assumption that PMC and contributors are pretty
> standard definitions by ASF. We can call those out explicitly.)
> - CS10, CS50 (We are also assuming implicitly CS20, CS30, and CS40 from
> main ASF doc)
> - IN10, IN20
> 
> Let me know what you think.
> 
> On Fri, Sep 1, 2017 at 10:32 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
> 
>> The Chair, PMC, and Committers may be different after graduation.
>> PMC/committers are sometimes not active committers but can have a
> valuable
>> role as mentors, in non-technical roles, as support people on the mailing
>> list, or as sometimes committers who don’t seem very active but come in
>> every so often to make a key contribution. So I hope this doesn’t become
> a
>> time to prune too deeply. I’d suggest we only do that if one of the
>> committers has done something to lessen our project maturity or wants to
> be
>> left out for their own reasons. An example of bad behavior is someone
>> trying to exert corporate dominance (which is severely frowned on by the
>> ASF). Another would be someone who is disruptive to the point of
> destroying
>> team effectiveness. I personally haven’t seen any of this but purposely
>> don’t read everything so chime in here.
>> 
>> It would be good to have people declare their interest-level. As for me,
>> I’d like to remain on the PMC as a committer but have no interest in
> Chair.
>> Since people can become busy periodically and not read @dev (me?) we
> could,
>> maybe should, poll the current committers and PMC to get the lists ready
>> for the graduation proposal.
>> 
>> 
>> Don’t forget that we are not just asking for dev community opinion about
>> graduation. We are also asking that people check things like the Maturity
>> Checklist to see it we are ready. http://community.apache.org/
>> apache-way/apache-project-maturity-model.html <
>> http://community.apache.org/apache-way/apache-project-
> maturity-model.html>
>> People seem fairly enthusiastic about applying for graduation, but are
>> there things we need to do before hand? The goal is to show that we do
> not
>> require the second level check for decisions that the IPMC provides. The
>> last release required no changes but had a proviso about content
> licenses.
>> This next release should fly through without provisos IMHO. Are there
> other
>> things we should do?
>> 
>> 
>> On Sep 1, 2017, at 6:16 AM, takako shimamoto <chiboch...@gmail.com>
> wrote:
>> 
>> I entirely agree with everyone else.
>> I hope the PIO community will become more active after graduation.
>> 
>>> 2. If we are to graduate, who should we include in the list of the
>> initial
>>> PMC?
>> 
>> Don't all present IPMC members are included in the list of 

Re: Recommender for social media

2017-09-05 Thread Pat Ferrel
Actually IMO it is not more complex, it is just far better documented and more 
flexible. If you don’t need the features it is just as simple as the Apache PIO 
Templates. I could argue the UR is simpler since you don’t need to $set every 
item and user, they are determined automatically from the data.

But in any case a recommender is a big-data application. 16GB on one machine 
will not get you very far, maybe a POC with limited data.

The next question is what do you need. If you need to use all of those pieces 
of data to recommend one thing, then the ALS algorithm of the Apache PIO 
Templates will not work, they can only take one “conversion” event. This is ok 
for some applications but it would mean using like alone to recommend other 
items. Not sure a create will work at all since the user may be the only one to 
interact with the created item, unless there are types of metadata associated 
with the created item. With the Apache Templates “follows” can only recommend 
users to follow.

The UR can use both likes and follows to recommend either items or users. It’s 
also likely that you can use other data you have. This may be what you mean by 
complex but then you don’t have to use the feature...


On Sep 5, 2017, at 2:10 AM, Brian Chiu  wrote:

Hi everyone.

I am trying to use PredictionIO to build a recommender for
social-media-like platform, but as I am new to recommender I would
like to get some suggestion from the community here.

The case is something like Twitter:
- A user can create an item
- A user can like an item
- A user can follow another user

I have spent sometime trying the official templates, but it seems that
they cannot take advantage of "follow another user" relationship.  I
notice that the "Universal Recommender" from actionML is more powerful
than the official template, but also more complex, and I don't know if
it is suitable for my use case.

Is "Universal Recommender" right choice?  Or is there a simpler
solution?  My machine has 16GB memory and around 50,000 users.

Thanks in advance!

Best,
Brian



Re: Securing Event Server on Heroku?

2017-09-01 Thread Pat Ferrel
TLS/SSL is required along with authentication of the HTTPS requests. I’m not 
familiar with Heroku but the Proxy must authenticate the incoming connections. 
Nginx has basic auth and is a fast proxy, for instance.

A cheap, dirty, and not recommended unless it is your only option, is to set 
your security restrictions to allow connections only from a known IP address or 
range where your app servers run (the servers using the PIO SDK). This would be 
a setting in Heroku I assume. In AWS it is done with PVC Security Groups.


On Sep 1, 2017, at 12:16 PM, Mars Hall  wrote:

Shane,

A whole different perspective to this, still involving private networks, is to 
deploy all the apps that need to access PIO directly onto the same network. No 
auth required!

Or, peer the PredictionIO private network with other cloud resources, such as 
Salesforce org IP restrictions.

On Fri, Sep 1, 2017 at 12:10 PM, Mars Hall > wrote:
Hi Shane,

As you've found, PredictionIO itself does not include a complete authorization 
solution. A general solution is to isolate PredictionIO from the internet on a 
private network, and then implement a gateway/proxy to authorize and route 
traffic to PredictionIO eventserver and engine query API.

With Heroku Enterprise, this architectural pattern may be implemented by 
provisioning two Private Spaces ; 
recommended naming pattern: example-public (frontend) & example (backend).

Configure the backend space to only trust incoming traffic from the public 
space and itself. In the Heroku Dashboard :
With two side-by-side browser windows, open the frontend & the backend spaces' 
Network settings.
Copy each of the frontend Space Outbound IPs to the backend Trusted IP Ranges.
CIDR notation for each individual IP is X.X.X.X/32.
Copy each of the backend Space Outbound IPs to its own Trusted IP Ranges.
CIDR notation for each individual IP is X.X.X.X/32.
Then, deploy PredictionIO apps to the backend space. In the frontend space, 
deploy a public proxy/gateway. We've used Node to make simple proxies, or try 
something like Kong API gateway on Heroku 
 and configure API's with simple key 
authorization.

Keep in mind, all public-facing traffic and inter-space traffic should be 
encrypted. SSL/TLS is not available by default for Private Spaces apps. 
Therefore, a custom domain name and certificates must be procured and installed 
for every app.

I'd like to see a best-practices pattern emerge around securing PredictionIO. I 
would love to hear about your ongoing progress,

*Mars

On Thu, Aug 31, 2017 at 10:24 PM, Shane Johnson > wrote:
Hi everyone. We are building an app exchange app that is leveraging the Heroku 
deployment of PIO. We are needing to secure the posts to the events.json 
endpoint as well the queries.json endpoint on Heroku.

Do you have any suggestions on how to add security around adding events and 
querying predictions. Is there an add-on on Heroku or would it be necessary to 
extend the scala code to look for a secret key. I would prefer to not extend 
the scala and have authentication happen at the heroku level if possible.

Thank you in advance!

Shane Johnson | 801.360.3350 
LinkedIn  | Facebook 





-- 
*Mars Hall
415-818-7039
Customer Facing Architect
Salesforce Platform / Heroku
San Francisco, California



Re: Error: Could not find or load main class org.apache.predictionio.tools.console.Console

2017-08-31 Thread Pat Ferrel
Downloading and building PIO is not sufficient to run it. You must also install 
all of it’s dependencies, services like Spark, HDFS, … 

The last concrete issue I saw from you was HDFS not running. PIO requires that 
HDFS is running as well as the other services before you can start it.

Go back and check that each service required is running correctly before trying 
to do anything with PIO. And remember that pio-start-all only starts some 
services so I don’t rely on it. Start the services you need by hand, verfiy 
they are working then start the pio EventServer. Continue from there.

At this point it might be better to wipe the machine and start over following 
only one set of instructions. Stick with that one since there are so many ways 
to configure PIO that switching between 2 install instructions will only 
confuse things.


On Aug 31, 2017, at 2:14 AM, Paritosh Piplewar <parit...@greentoe.com> wrote:

I directly downloaded from source and ran the ./make_distribution.sh. this is 
the contrived example which will help you to replicate the issue 
https://gist.github.com/passion8/2769147c5352df4dad610100226f3b66 
<https://gist.github.com/passion8/2769147c5352df4dad610100226f3b66> 

system : Ubuntu 16.04.3 x64



-- 
Paritosh Piplewar
Sent with Airmail

On 30 August 2017 at 5:22:43 PM, Pat Ferrel (p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>) wrote:

> Can you explain how you installed and what your problem is? The link below 
> doesn’t contain much information. 
> 
> 
> On Aug 29, 2017, at 9:02 PM, Paritosh Piplewar <parit...@greentoe.com 
> <mailto:parit...@greentoe.com>> wrote:
> 
> 
> Yes I'm running pio from inside bin directory. 
> Sent from my iPhone
> 
> On 30-Aug-2017, at 3:41 AM, Mars Hall <mars.h...@salesforce.com 
> <mailto:mars.h...@salesforce.com>> wrote:
> 
>> I've seen this error occur when an old `pio` command is used with a newer 
>> install of PredictionIO.
>> 
>> Are you running the `pio` command from the `bin/` directory inside your 
>> fresh PredictionIO?
>> 
>> I've seen folks add "PredictionIO-dist/bin/" to their $PATH, so running 
>> `pio` is a simple command. This technique can go stale and cause these 
>> errors.
>> 
>> On Tue, Aug 29, 2017 at 2:26 PM, Paritosh Piplewar <parit...@greentoe.com 
>> <mailto:parit...@greentoe.com>> wrote:
>> How to resolve this error. i did fresh installation and i am 100% sure i did 
>> everything according to what is mentioned in the site 
>> http://predictionio.incubator.apache.org/install/install-sourcecode/ 
>> <http://predictionio.incubator.apache.org/install/install-sourcecode/>
>> 
>> this is what assembly folder look like now http://15as.com/2g3G0M1a1v1w 
>> <http://15as.com/2g3G0M1a1v1w> 
>> 
>> -- 
>> Paritosh Piplewar
>> Sent with Airmail
>> 
>> 
>> 
>> --
>> *Mars Hall
>> 415-818-7039
>> Customer Facing Architect
>> Salesforce Platform / Heroku
>> San Francisco, California



Re: Graduation to TLP

2017-08-30 Thread Pat Ferrel
IMO the VP/Chair responsibilities are mostly administrative and tedious. In the 
best case there is leadership but that will naturally come from thought leaders 
with no regard to title.


On Aug 30, 2017, at 4:41 PM, Chan Lee <chanlee...@gmail.com> wrote:

Agree with everyone else. +1 for graduation.

I was away for a while due to family issues, but I'd be happy to volunteer
for release management.


On Wed, Aug 30, 2017 at 4:30 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> Along with the link Donald gave, check this out:
> http://community.apache.org/apache-way/apache-project-maturity-model.html
> <http://community.apache.org/apache-way/apache-project-maturity-model.html
>> 
> 
> We will need at least 2 release managers before graduation, any volunteers?
> 
> 
> On Aug 30, 2017, at 3:51 PM, Mars Hall <mars.h...@salesforce.com> wrote:
> 
> Thank you Donald for leading the charge here,
> 
> From my perspective PredictionIO is already Apache in process & title.
> Graduation seems quite natural to reach top-level recognition.
> 
> I'm interested in helping with PMC duties. Would be great to understand
> what the VP vs Member responsibilities look like.
> 
> Let's graduate. +1
> 
> *Mars
> 
> 
> On Wed, Aug 30, 2017 at 15:21 Pat Ferrel <p...@occamsmachete.com> wrote:
> 
>> I have had several people tell me they want to wait until PIO is not
>> incubating before using it. This even after explaining that “incubating”
>> has more to do with getting into the Apache Way of doing things and has
> no
>> direct link to quality or community. I can only conclude from this that
>> “incubating” is holding back adoption.
>> 
>> And yet we have absorbed the Apache Way and will have at least 3 releases
>> (including 12) a incubating. We have brought in a fair number of new
>> committers and seem to have a healthy community of users.
>> 
>> +1 for a push to graduate.
>> 
>> 
>> On Aug 28, 2017, at 10:20 PM, Donald Szeto <don...@apache.org> wrote:
>> 
>> Hi all,
>> 
>> Since the ASF Board meeting in May (
>> 
>> http://apache.org/foundation/records/minutes/2017/board_
> minutes_2017_05_17.txt
>> ),
>> PredictionIO has been considered nearing graduation and I think we are
>> almost there. I am kickstarting this thread so that we can discuss on
> these
>> 3 things:
>> 
>> 1. Does the development community feel ready to graduate?
>> 2. If we are to graduate, who should we include in the list of the
> initial
>> PMC?
>> 3. If we are to graduate, who should be the VP of the initial PMC?
>> 
>> These points are relevant for graduation. Please take a look at the
>> official graduation guide:
>> http://incubator.apache.org/guides/graduation.html.
>> 
>> In addition, Sara and I have been working to transfer the PredictionIO
>> trademark to the ASF. We will keep you updated with our progress.
>> 
>> I would also like to propose to cut a 0.12.0 release by merging JIRAs
> that
>> have a target version set to 0.12.0-incubating for graduation. 0.12.0
> will
>> contain cleanups for minor license and copyright issues that were pointed
>> out in previous releases by IPMC.
>> 
>> Let me know what you think.
>> 
>> Regards,
>> Donald
>> 
>> --
> *Mars Hall
> 415-818-7039
> Customer Facing Architect
> Salesforce Platform / Heroku
> San Francisco, California
> 
> 



Re: Graduation to TLP

2017-08-30 Thread Pat Ferrel
Along with the link Donald gave, check this out: 
http://community.apache.org/apache-way/apache-project-maturity-model.html 
<http://community.apache.org/apache-way/apache-project-maturity-model.html>

We will need at least 2 release managers before graduation, any volunteers?


On Aug 30, 2017, at 3:51 PM, Mars Hall <mars.h...@salesforce.com> wrote:

Thank you Donald for leading the charge here,

From my perspective PredictionIO is already Apache in process & title.
Graduation seems quite natural to reach top-level recognition.

I'm interested in helping with PMC duties. Would be great to understand
what the VP vs Member responsibilities look like.

Let's graduate. +1

*Mars


On Wed, Aug 30, 2017 at 15:21 Pat Ferrel <p...@occamsmachete.com> wrote:

> I have had several people tell me they want to wait until PIO is not
> incubating before using it. This even after explaining that “incubating”
> has more to do with getting into the Apache Way of doing things and has no
> direct link to quality or community. I can only conclude from this that
> “incubating” is holding back adoption.
> 
> And yet we have absorbed the Apache Way and will have at least 3 releases
> (including 12) a incubating. We have brought in a fair number of new
> committers and seem to have a healthy community of users.
> 
> +1 for a push to graduate.
> 
> 
> On Aug 28, 2017, at 10:20 PM, Donald Szeto <don...@apache.org> wrote:
> 
> Hi all,
> 
> Since the ASF Board meeting in May (
> 
> http://apache.org/foundation/records/minutes/2017/board_minutes_2017_05_17.txt
> ),
> PredictionIO has been considered nearing graduation and I think we are
> almost there. I am kickstarting this thread so that we can discuss on these
> 3 things:
> 
> 1. Does the development community feel ready to graduate?
> 2. If we are to graduate, who should we include in the list of the initial
> PMC?
> 3. If we are to graduate, who should be the VP of the initial PMC?
> 
> These points are relevant for graduation. Please take a look at the
> official graduation guide:
> http://incubator.apache.org/guides/graduation.html.
> 
> In addition, Sara and I have been working to transfer the PredictionIO
> trademark to the ASF. We will keep you updated with our progress.
> 
> I would also like to propose to cut a 0.12.0 release by merging JIRAs that
> have a target version set to 0.12.0-incubating for graduation. 0.12.0 will
> contain cleanups for minor license and copyright issues that were pointed
> out in previous releases by IPMC.
> 
> Let me know what you think.
> 
> Regards,
> Donald
> 
> --
*Mars Hall
415-818-7039
Customer Facing Architect
Salesforce Platform / Heroku
San Francisco, California



Re: Graduation to TLP

2017-08-30 Thread Pat Ferrel
I have had several people tell me they want to wait until PIO is not incubating 
before using it. This even after explaining that “incubating” has more to do 
with getting into the Apache Way of doing things and has no direct link to 
quality or community. I can only conclude from this that “incubating” is 
holding back adoption.

And yet we have absorbed the Apache Way and will have at least 3 releases 
(including 12) a incubating. We have brought in a fair number of new committers 
and seem to have a healthy community of users.

+1 for a push to graduate.  


On Aug 28, 2017, at 10:20 PM, Donald Szeto  wrote:

Hi all,

Since the ASF Board meeting in May (
http://apache.org/foundation/records/minutes/2017/board_minutes_2017_05_17.txt),
PredictionIO has been considered nearing graduation and I think we are
almost there. I am kickstarting this thread so that we can discuss on these
3 things:

1. Does the development community feel ready to graduate?
2. If we are to graduate, who should we include in the list of the initial
PMC?
3. If we are to graduate, who should be the VP of the initial PMC?

These points are relevant for graduation. Please take a look at the
official graduation guide:
http://incubator.apache.org/guides/graduation.html.

In addition, Sara and I have been working to transfer the PredictionIO
trademark to the ASF. We will keep you updated with our progress.

I would also like to propose to cut a 0.12.0 release by merging JIRAs that
have a target version set to 0.12.0-incubating for graduation. 0.12.0 will
contain cleanups for minor license and copyright issues that were pointed
out in previous releases by IPMC.

Let me know what you think.

Regards,
Donald



Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-30 Thread Pat Ferrel
Matt, I’m interested in following up on this. If you can’t do a PR, can you 
describe what you did a bit more? 


On Aug 21, 2017, at 12:05 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

Matt

I’ll create a feature branch of Mahout in my git repo for simplicity (we are in 
code freeze for Mahout right now) Then if you could peel off you changes and 
make a PR against it. Everyone can have a look before any change is made to the 
ASF repos.

Do a PR against this https://github.com/pferrel/mahout/tree/sparse-speedup 
<https://github.com/pferrel/mahout/tree/sparse-speedup>, even if it’s not 
working we can take a look. The branch right now is just a snapshot of the 
current master in code freeze.

Mahout has always had methods to work with different levels of sparsity and you 
may have found a missing point to optimize. Let’s hope so.


On Aug 21, 2017, at 11:47 AM, Andrew Palumbo <ap@outlook.com> wrote:

I should mention that the densisty is currently set quite high, and we've been 
discussing a user defined setting for this.  Something that we have not worked 
in yet.


From: Andrew Palumbo <ap@outlook.com>
Sent: Monday, August 21, 2017 2:44:35 PM
To: user@mahout.apache.org
Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues 
(SimilarityAnalysis.cooccurrencesIDSs)


We do currently have optimizations based on density analysis in use e.g.: in 
AtB.


https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b45863df3e53/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala#L431



+1 to PR. thanks for pointing this out.


--andy

____________
From: Pat Ferrel <p...@occamsmachete.com>
Sent: Monday, August 21, 2017 2:26:58 PM
To: user@mahout.apache.org
Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues 
(SimilarityAnalysis.cooccurrencesIDSs)

That looks like ancient code from the old mapreduce days. If is passes unit 
tests create a PR.

Just a guess here but there are times when this might not speed up thing but 
slow them down. However for vey sparse matrixes that you might see in CF this 
could work quite well. Some of the GPU optimization will eventually be keyed 
off the density of a matrix, or selectable from knowing it’s characteristics.

I use this code all the time and would be very interested in a version that 
works with CF style very sparse matrices.

Long story short, create a PR so the optimizer guys can think through the 
implications. If I can also test it I have some large real-world data where I 
can test real-world speedup.


On Aug 21, 2017, at 10:53 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

Interesting indeed. What is “massive”? Does the change pass all unit tests?


On Aug 17, 2017, at 1:04 PM, Scruggs, Matt <matt.scru...@bronto.com> wrote:

Thanks for the remarks guys!

I profiled the code running locally on my machine and discovered this loop is 
where these setQuick() and getQuick() calls originate (during matrix Kryo 
deserialization), and as you can see the complexity of this 2D loop can be very 
high:

https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b45863df3e53/math/src/main/java/org/apache/mahout/math/AbstractMatrix.java#L240


Recall that this algorithm uses SparseRowMatrix whose rows are 
SequentialAccessSparseVector, so all this looping seems unnecessary. I created 
a new subclass of SparseRowMatrix that overrides that assign(matrix, function) 
method, and instead of looping through all the columns of each row, it calls 
SequentialAccessSparseVector.iterateNonZero() so it only has to touch the cells 
with values. I also had to customize MahoutKryoRegistrator a bit with a new 
default serializer for this new matrix class. This yielded a massive 
performance boost and I verified that the results match exactly for several 
test cases and datasets. I realize this could have side-effects in some cases, 
but I'm not using any other part of Mahout, only 
SimilaritAnalysis.cooccurrencesIDSs().

Any thoughts / comments?


Matt



On 8/16/17, 8:29 PM, "Ted Dunning" <ted.dunn...@gmail.com> wrote:

> It is common with large numerical codes that things run faster in memory on
> just a few cores if the communication required outweighs the parallel
> speedup.
> 
> The issue is that memory bandwidth is slower than the arithmetic speed by a
> very good amount. If you just have to move stuff into the CPU and munch on
> it a bit it is one thing, but if you have to move the data to CPU and back
> to memory to distributed it around possibly multiple times, you may wind up
> with something much slower than you would have had if you were to attack
> the problem directly.
> 
> 
> 
> On Wed, Aug 16, 2017 at 4:47 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> 
>> This uses the Mahout blas optimizing solver, which I just use and do

Re: Error: Could not find or load main class org.apache.predictionio.tools.console.Console

2017-08-30 Thread Pat Ferrel
Can you explain how you installed and what your problem is? The link below 
doesn’t contain much information. 


On Aug 29, 2017, at 9:02 PM, Paritosh Piplewar  wrote:


Yes I'm running pio from inside bin directory. 
Sent from my iPhone

On 30-Aug-2017, at 3:41 AM, Mars Hall > wrote:

> I've seen this error occur when an old `pio` command is used with a newer 
> install of PredictionIO.
> 
> Are you running the `pio` command from the `bin/` directory inside your fresh 
> PredictionIO?
> 
> I've seen folks add "PredictionIO-dist/bin/" to their $PATH, so running `pio` 
> is a simple command. This technique can go stale and cause these errors.
> 
> On Tue, Aug 29, 2017 at 2:26 PM, Paritosh Piplewar  > wrote:
> How to resolve this error. i did fresh installation and i am 100% sure i did 
> everything according to what is mentioned in the site 
> http://predictionio.incubator.apache.org/install/install-sourcecode/ 
> 
> 
> this is what assembly folder look like now http://15as.com/2g3G0M1a1v1w 
>  
> 
> -- 
> Paritosh Piplewar
> Sent with Airmail
> 
> 
> 
> -- 
> *Mars Hall
> 415-818-7039
> Customer Facing Architect
> Salesforce Platform / Heroku
> San Francisco, California



Re: sbt.ResolveException: unresolved dependency: org.apache.predictionio#pio-build;0.11.0-incubrating

2017-08-22 Thread Pat Ferrel
You template is linking to 
org.apache.predictionio#pio-build;0.10.0-incubrating, what do you have 
installed? org.apache.predictionio#pio-build;0.11.0-incubrating?

Looks like you have to change your templates build.sbt to link to the artifact 
you have built.


On Aug 22, 2017, at 3:52 AM, Abhimanyu Nagrath  
wrote:

Hi I am new to predictionio( v 0.11.0-incubrationg with spark- 2.1.0  
elasticsearch- 5.2.1 hbase - 1.2.6)and I am using template 
https://github.com/alexice/template-scala-parallel-svd-item-similarity 
 . This 
template is compaitable with min 0.9.2 version of predictionio and require 
apache pio conversion . So I replacedio.prediction with org.apache.predictionio 
and changed the version to 0.11.0-incubrating in 
files(build.sbt,template.json,project/assembly.sbt 
,project/pio-build.sbt)
 after that in my template folder I just ran pio build --verbose and git the 
following error :

[INFO] [Engine$] [warn] ::
[INFO] [Engine$] [warn] ::  UNRESOLVED DEPENDENCIES ::
[INFO] [Engine$] [warn] ::
[INFO] [Engine$] [warn] :: 
org.apache.predictionio#pio-build;0.10.0-incubrating: not found
[INFO] [Engine$] [warn] ::
[INFO] [Engine$] [warn] 
[INFO] [Engine$] [warn] Note: Some unresolved dependencies have extra 
attributes.  Check that these dependencies exist with the requested attributes.
[INFO] [Engine$] [warn] 
org.apache.predictionio:pio-build:0.11.0-incubrating (scalaVersion=2.10, 
sbtVersion=0.13)
[INFO] [Engine$] [warn] 
[INFO] [Engine$] [warn] Note: Unresolved dependencies path:
[INFO] [Engine$] [warn] 
org.apache.predictionio:pio-build:0.11.0-incubrating (scalaVersion=2.10, 
sbtVersion=0.13) 
(/predictionio/apache-predictionio-0.11.0-incubating/template-scala-parallel-svd-item-similarity/project/pio-build.sbt#L1-2)
[INFO] [Engine$] [warn]   +- 
default:template-scala-parallel-svd-item-similarity-build:0.1-SNAPSHOT 
(scalaVersion=2.10, sbtVersion=0.13)
[INFO] [Engine$] sbt.ResolveException: unresolved dependency: 
org.apache.predictionio#pio-build;0.11.0-incubrating: not found
[INFO] [Engine$]at 
sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:313)
[INFO] [Engine$]at 
sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191)
[INFO] [Engine$]at 
sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:168)
[INFO] [Engine$]at 
sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
[INFO] [Engine$]at 
sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
[INFO] [Engine$]at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:133)
[INFO] [Engine$]at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
[INFO] [Engine$]at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
[INFO] [Engine$]at 
xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
[INFO] [Engine$]at 
xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
[INFO] [Engine$]at 
xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
[INFO] [Engine$]at xsbt.boot.Using$.withResource(Using.scala:10)
[INFO] [Engine$]at xsbt.boot.Using$.apply(Using.scala:9)
[INFO] [Engine$]at 
xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
[INFO] [Engine$]at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
[INFO] [Engine$]at xsbt.boot.Locks$.apply0(Locks.scala:31)
[INFO] [Engine$]at xsbt.boot.Locks$.apply(Locks.scala:28)
[INFO] [Engine$]at sbt.IvySbt.withDefaultLogger(Ivy.scala:65)
[INFO] [Engine$]at sbt.IvySbt.withIvy(Ivy.scala:128)
[INFO] [Engine$]at sbt.IvySbt.withIvy(Ivy.scala:125)
[INFO] [Engine$]at sbt.IvySbt$Module.withModule(Ivy.scala:156)
[INFO] [Engine$]at sbt.IvyActions$.updateEither(IvyActions.scala:168)
[INFO] [Engine$]at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1481)
[INFO] [Engine$]at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1477)
[INFO] [Engine$]at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$121.apply(Defaults.scala:1512)
[INFO] [Engine$]at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$121.apply(Defaults.scala:1510)
[INFO] [Engine$]at 
sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37)
[INFO] [Engine$]at 
sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1515)
[INFO] [Engine$]at 
sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1509)
[INFO] [Engine$]at 

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Pat Ferrel
Matt

I’ll create a feature branch of Mahout in my git repo for simplicity (we are in 
code freeze for Mahout right now) Then if you could peel off you changes and 
make a PR against it. Everyone can have a look before any change is made to the 
ASF repos.

Do a PR against this https://github.com/pferrel/mahout/tree/sparse-speedup 
<https://github.com/pferrel/mahout/tree/sparse-speedup>, even if it’s not 
working we can take a look. The branch right now is just a snapshot of the 
current master in code freeze.

Mahout has always had methods to work with different levels of sparsity and you 
may have found a missing point to optimize. Let’s hope so.


On Aug 21, 2017, at 11:47 AM, Andrew Palumbo <ap@outlook.com> wrote:

I should mention that the densisty is currently set quite high, and we've been 
discussing a user defined setting for this.  Something that we have not worked 
in yet.


From: Andrew Palumbo <ap@outlook.com>
Sent: Monday, August 21, 2017 2:44:35 PM
To: user@mahout.apache.org
Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues 
(SimilarityAnalysis.cooccurrencesIDSs)


We do currently have optimizations based on density analysis in use e.g.: in 
AtB.


https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b45863df3e53/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/package.scala#L431



+1 to PR. thanks for pointing this out.


--andy

________
From: Pat Ferrel <p...@occamsmachete.com>
Sent: Monday, August 21, 2017 2:26:58 PM
To: user@mahout.apache.org
Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues 
(SimilarityAnalysis.cooccurrencesIDSs)

That looks like ancient code from the old mapreduce days. If is passes unit 
tests create a PR.

Just a guess here but there are times when this might not speed up thing but 
slow them down. However for vey sparse matrixes that you might see in CF this 
could work quite well. Some of the GPU optimization will eventually be keyed 
off the density of a matrix, or selectable from knowing it’s characteristics.

I use this code all the time and would be very interested in a version that 
works with CF style very sparse matrices.

Long story short, create a PR so the optimizer guys can think through the 
implications. If I can also test it I have some large real-world data where I 
can test real-world speedup.


On Aug 21, 2017, at 10:53 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

Interesting indeed. What is “massive”? Does the change pass all unit tests?


On Aug 17, 2017, at 1:04 PM, Scruggs, Matt <matt.scru...@bronto.com> wrote:

Thanks for the remarks guys!

I profiled the code running locally on my machine and discovered this loop is 
where these setQuick() and getQuick() calls originate (during matrix Kryo 
deserialization), and as you can see the complexity of this 2D loop can be very 
high:

https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b45863df3e53/math/src/main/java/org/apache/mahout/math/AbstractMatrix.java#L240


Recall that this algorithm uses SparseRowMatrix whose rows are 
SequentialAccessSparseVector, so all this looping seems unnecessary. I created 
a new subclass of SparseRowMatrix that overrides that assign(matrix, function) 
method, and instead of looping through all the columns of each row, it calls 
SequentialAccessSparseVector.iterateNonZero() so it only has to touch the cells 
with values. I also had to customize MahoutKryoRegistrator a bit with a new 
default serializer for this new matrix class. This yielded a massive 
performance boost and I verified that the results match exactly for several 
test cases and datasets. I realize this could have side-effects in some cases, 
but I'm not using any other part of Mahout, only 
SimilaritAnalysis.cooccurrencesIDSs().

Any thoughts / comments?


Matt



On 8/16/17, 8:29 PM, "Ted Dunning" <ted.dunn...@gmail.com> wrote:

> It is common with large numerical codes that things run faster in memory on
> just a few cores if the communication required outweighs the parallel
> speedup.
> 
> The issue is that memory bandwidth is slower than the arithmetic speed by a
> very good amount. If you just have to move stuff into the CPU and munch on
> it a bit it is one thing, but if you have to move the data to CPU and back
> to memory to distributed it around possibly multiple times, you may wind up
> with something much slower than you would have had if you were to attack
> the problem directly.
> 
> 
> 
> On Wed, Aug 16, 2017 at 4:47 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> 
>> This uses the Mahout blas optimizing solver, which I just use and do not
>> know well. Mahout virtualizes some things having to do with partitioning
>> and I’ve never quite understood how they work. There is a .par() on one of
>> the matrix clas

<    1   2   3   4   5   6   7   8   9   10   >