Re: ALS update without re-computing everything

2016-03-11 Thread Nick Pentreath
There is a general movement to allowing initial models to be specified for Spark ML algorithms, so I'll add a JIRA to that task set. I should be able to work on this as well as other ALS improvements. Oh, another reason fold-in is typically not done in Spark is that for models of any reasonable

Re: ALS update without re-computing everything

2016-03-11 Thread Sean Owen
On Fri, Mar 11, 2016 at 12:18 PM, Nick Pentreath wrote: > In general, for serving situations MF models are stored in some other > serving system, so that system may be better suited to do the actual > fold-in. Sean's Oryx project does that, though I'm not sure offhand if

Re: ALS update without re-computing everything

2016-03-11 Thread Nick Pentreath
Currently this is not supported. If you want to do incremental fold-in of new data you would need to do it outside of Spark (e.g. see this discussion: https://mail-archives.apache.org/mod_mbox/spark-user/201603.mbox/browser, which also mentions a streaming on-line MF implementation with SGD). In

unsubscribe

2016-03-11 Thread ????/??????

ALS update without re-computing everything

2016-03-11 Thread Roberto Pagliari
In the current implementation of ALS with implicit feedback, when new date come in, it is not possible to update user/product matrices without re-computing everything. Is this feature in planning or any known work around? Thank you,

Re: Spark configuration with 5 nodes

2016-03-11 Thread Steve Loughran
On 10 Mar 2016, at 22:15, Ashok Kumar > wrote: Hi, We intend to use 5 servers which will be utilized for building Bigdata Hadoop data warehouse system (not using any propriety distribution like Hortonworks or Cloudera or

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Nick Pentreath
Would you mind letting us know the # training examples in the datasets? Also, what do your features look like? Are they text, categorical etc? You mention that most rows only have a few features, and all rows together have a few 10,000s features, yet your max feature value is 20 million. How are

Strange behavior of collectNeighbors API in GraphX

2016-03-11 Thread Zhaokang Wang
Hi all, These days I havemet a problem of GraphX鈥檚 strange behavior on collectNeighborsAPI. It seems that this API has side-effects on the Pregel API.It makes Pregel API not work as expected. The following is asmall code demo to reproduce this

kill Spark Streaming job gracefully

2016-03-11 Thread Shams ul Haque
Hi, I want to kill a Spark Streaming job gracefully, so that whatever Spark has picked from Kafka have processed. My Spark version is: 1.6.0 When i tried killing a Spark Streaming Job from Spark UI dosen't stop app completely. In Spark-UI job is moved to COMPLETED section, but in log it

Re: Running ALS on comparitively large RDD

2016-03-11 Thread Deepak Gopalakrishnan
Executor memory : 45g X 4 executors , 1 Driver with 45g memory Data Source is from S3 and I've logs that tells me the Rating objects are loaded fine. On Fri, Mar 11, 2016 at 2:13 PM, Nick Pentreath wrote: > Hmmm, something else is going on there. What data source are

Re: Can we use spark inside a web service?

2016-03-11 Thread Hemant Bhanawat
Spark-jobserver is an elegant product that builds concurrency on top of Spark. But, the current design of DAGScheduler prevents Spark to become a truly concurrent solution for low latency queries. DagScheduler will turn out to be a bottleneck for low latency queries. Sparrow project was an effort

How to efficiently query a large table with multiple dimensional table?

2016-03-11 Thread ashokkumar rajendran
Hi All, I have a large table with few billions of rows and have a very small table with 4 dimensional values. I would like to get rows that match any of these dimensions. For example, Select field1, field2 from A, B where A.dimension1 = B.dimension1 OR A.dimension2 = B.dimension2 OR A.dimension3

can checkpoint and write ahead log save the data in queued batch?

2016-03-11 Thread Yu Xie
Hi spark user I am running an spark streaming app that use receiver from a pubsub system, and the pubsub system does NOT support ack. And I don't want the data to be lost if there is a driver failure, and by accident, the batches queue up at that time. I tested by generating some queued

Re: Running ALS on comparitively large RDD

2016-03-11 Thread Nick Pentreath
Hmmm, something else is going on there. What data source are you reading from? How much driver and executor memory have you provided to Spark? On Fri, 11 Mar 2016 at 09:21 Deepak Gopalakrishnan wrote: > 1. I'm using about 1 million users against few thousand products. I >

Re: Installing Spark on Mac

2016-03-11 Thread Jakob Odersky
regarding my previous message, I forgot to mention to run netstat as root (sudo netstat -plunt) sorry for the noise On Fri, Mar 11, 2016 at 12:29 AM, Jakob Odersky wrote: > Some more diagnostics/suggestions: > > 1) are other services listening to ports in the 4000 range (run >

Re: Zeppelin Integration

2016-03-11 Thread Mich Talebzadeh
BTW, when the daemon is stopped on the host, the notebook just hangs if it was running, without any errors. The only way is to tail the last log in $ZEPPELIN_HOME/logs. So I would say a cron type job is required to scan the log for errors. Dr Mich Talebzadeh LinkedIn *

Re: Installing Spark on Mac

2016-03-11 Thread Jakob Odersky
Some more diagnostics/suggestions: 1) are other services listening to ports in the 4000 range (run "netstat -plunt")? Maybe there is an issue with the error message itself. 2) are you sure the correct java version is used? java -version 3) can you revert all installation attempts you have done

Strange behavior of collectNeighbors API in GraphX

2016-03-11 Thread Zhaokang Wang
Hi all, These days I have met a problem of GraphX’s strange behavior on |collectNeighbors| API. It seems that this API has side-effects on the Pregel API. It makes Pregel API not work as expected. The following is a small code demo to reproduce this strange behavior. You can get the whole

<    1   2