Fwd: Writing streaming data to cassandra creates duplicates
Yes...union would be one solution. I am not doing any aggregation hence reduceByKey would not be useful. If I use groupByKey, messages with same key would be obtained in a partition. But groupByKey is very expensive operation as it involves shuffle operation. My ultimate goal is to write the messages to cassandra. if the messages with same key are handled by different streams...there would be concurrency issues. To resolve this i can union dstreams and apply hash parttioner so that it would bring all the same keys to a single partition or do a groupByKey which does the same. As groupByKey is expensive, is there any work around for this ? On Thu, Jul 30, 2015 at 2:33 PM, Juan RodrĂguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi, Just my two cents. I understand your problem is that your problem is that you have messages with the same key in two different dstreams. What I would do would be making a union of all the dstreams with StreamingContext.union or several calls to DStream.union, and then I would create a pair dstream with the primary key as key, and then I'd use groupByKey or reduceByKey (or combineByKey etc) to combine the messages with the same primary key. Hope that helps. Greetings, Juan 2015-07-30 10:50 GMT+02:00 Priya Ch learnings.chitt...@gmail.com: Hi All, Can someone throw insights on this ? On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch learnings.chitt...@gmail.com wrote: Hi TD, Thanks for the info. I have the scenario like this. I am reading the data from kafka topic. Let's say kafka has 3 partitions for the topic. In my streaming application, I would configure 3 receivers with 1 thread each such that they would receive 3 dstreams (from 3 partitions of kafka topic) and also I implement partitioner. Now there is a possibility of receiving messages with same primary key twice or more, one is at the time message is created and other times if there is an update to any fields for same message. If two messages M1 and M2 with same primary key are read by 2 receivers then even the partitioner in spark would still end up in parallel processing as there are altogether in different dstreams. How do we address in this situation ? Thanks, Padma Ch On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das t...@databricks.com wrote: You have to partition that data on the Spark Streaming by the primary key, and then make sure insert data into Cassandra atomically per key, or per set of keys in the partition. You can use the combination of the (batch time, and partition Id) of the RDD inside foreachRDD as the unique id for the data you are inserting. This will guard against multiple attempts to run the task that inserts into Cassandra. See http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations TD On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch learnings.chitt...@gmail.com wrote: Hi All, I have a problem when writing streaming data to cassandra. Or existing product is on Oracle DB in which while wrtiting data, locks are maintained such that duplicates in the DB are avoided. But as spark has parallel processing architecture, if more than 1 thread is trying to write same data i.e with same primary key, is there as any scope to created duplicates? If yes, how to address this problem either from spark or from cassandra side ? Thanks, Padma Ch
Re: How to help for 1.5 release?
I think you can start from here https://issues.apache.org/jira/browse/SPARK/fixforversion/12332078/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel Thanks Best Regards On Tue, Aug 4, 2015 at 12:02 PM, Meihua Wu rotationsymmetr...@gmail.com wrote: I think the team is preparing for the 1.5 release. Anything to help with the QA, testing etc? Thanks, MW
How to help for 1.5 release?
I think the team is preparing for the 1.5 release. Anything to help with the QA, testing etc? Thanks, MW
Re: Have Friedman's glmnet algo running in Spark
I have a follow up on this: I see on JIRA that the idea of having a GLMNET implementation was more or less abandoned, since a OWLQN implementation was chosen to construct a model using L1/L2 regularization. However, GLMNET has the property of returning a multitide of models (corresponding to different vales of penalty parameters [for the regularization]). I think this is not the case in the OWLQN implementation. However, this would be really helpful to compare the accuracy of models with different regParam values. As far as I understood, this would avoid to have a costly cross-validation step over a possibly large set of regParam values. Joseph Bradley wrote Some of this discussion seems valuable enough to preserve on the JIRA; can we move it there (and copy any relevant discussions from previous emails as needed)? On Wed, Feb 25, 2015 at 10:35 AM, lt; mike@ gt; wrote: -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Have-Friedman-s-glmnet-algo-running-in-Spark-tp10692p13587.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Have Friedman's glmnet algo running in Spark
My friends and I are continuing work on the algorithm. You are right that there are two elements to Friedman's glmnet algorithm. One is the use of coordinate descent for minimizing penalized regression with an absolute value penalty and the other is managing the regularization parameters. Friedmans algorithm does return the the entire regularization path. We have had to get fairly deep into the mechanics of linear algebra. The tricky part has been arranging the matrix and vector multiplications to minimize the compute times - (e.g. big time differences between multiplying by a submatrix versus mulbiplying by the columns in the submatrix, etc. ) All of the versions we've produced generate a multitude of solutions (default = 100) for a range of different values of the regularization parameter. The solutions always cover the most heavily penalized end of the curve. The number of solutions generated depends on how fine the steps are and how close the solutions get to the fully saturated (un-penalized) solution. Default values for these work about 80% of the time. Personally, i've always found it useful to have the entire regularization path. One way or another, that's always required to get a final solution. It's just a question of whether the points on the path are generated by hunting and pecking or done all in one shot systematically. mike -Original Message- From: Patrick [mailto:petz2...@gmail.com] Sent: Tuesday, August 4, 2015 12:50 AM To: d...@sparapache.org Subject: Re: Have Friedman's glmnet algo running in Spark I have a follow up on this: I see on JIRA that the idea of having a GLMNET imp entation was more orless abandoned, since a OWLQN implementation was chosen to construct a modelusing L1/L2 regularization. However, GLMNET has the property of returning a multitide of models(corresponding to different vales of penalty parameters [for theregularization]). I think this is not the case in the OWLQN implementation. However, this would be really helpful to compare the accuracy of models withdifferent regParam values. As far as I understood, this would avoid to have a costly cross-validationstep over a possibly large set of regParam values. Joseph Bradley wrote Some of this discussion seems valuable enough to preserve on the JIRA; can we move it there (and copy any relevant discussions from previous emails as needed)? On Wed, Feb 25, 2015 at 10:35 AM, mike@ wrote:--View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Have-Friedman-s-glmnet-algo-running-in-Spark-tp10692p13587.htmlSent from the Apache Spark Developers List mailing list archive at Nabble.com.-To unsubscribe, e-mail: dev-unsubscribe@spark.apache.orgFor additional commands, e-mail: dev-h...@spark.apache.org
Re: How to help for 1.5 release?
Hey Meihua, If you are a user of Spark, one thing that is really helpful is to run Spark 1.5 on your workload and report any issues, performance regressions, etc. - Patrick On Mon, Aug 3, 2015 at 11:49 PM, Akhil Das ak...@sigmoidanalytics.com wrote: I think you can start from here https://issues.apache.org/jira/browse/SPARK/fixforversion/12332078/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel Thanks Best Regards On Tue, Aug 4, 2015 at 12:02 PM, Meihua Wu rotationsymmetr...@gmail.com wrote: I think the team is preparing for the 1.5 release. Anything to help with the QA, testing etc? Thanks, MW - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
shane will be OOO 8-5-15 through 8-18-15
so i done gone and got myself hitched, and will be disappearing in to the rainy island of kol chang in thailand for the next ~2 weeks. :) this means i will be completely out of contact, and have to leave jenkins in the gentle hands of jon kuroda (a sysadmin here at the lab) and matt massie (my boss). they've been CCed on this email, and briefed on the basic operations, so should be able to maintain things while i'm gone. i will ask that during these next couple of weeks that we hold off on any major system changes and package installations, unless it's a blocker and needed for any releases. this is mostly my fault, as i've not finished porting all of the bash setup scripts (of doom) to ansible and i'd like to minimize feature drift. anyways, i'll be back in a couple of weeks, so don't break anything. :) shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org