Fwd: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread Priya Ch
Yes...union would be one solution. I am not doing any aggregation hence
reduceByKey would not be useful. If I use groupByKey, messages with same
key would be obtained in a partition. But groupByKey is very expensive
operation as it involves shuffle operation. My ultimate goal is to write
the messages to cassandra. if the messages with same key are handled by
different streams...there would be concurrency issues. To resolve this i
can union dstreams and apply hash parttioner so that it would bring all the
same keys to a single partition or do a groupByKey which does the same.

As groupByKey is expensive, is there any work around for this ?

On Thu, Jul 30, 2015 at 2:33 PM, Juan Rodríguez Hortalá 
juan.rodriguez.hort...@gmail.com wrote:

 Hi,

 Just my two cents. I understand your problem is that your problem is that
 you have messages with the same key in two different dstreams. What I would
 do would be making a union of all the dstreams with StreamingContext.union
 or several calls to DStream.union, and then I would create a pair dstream
 with the primary key as key, and then I'd use groupByKey or reduceByKey (or
 combineByKey etc) to combine the messages with the same primary key.

 Hope that helps.

 Greetings,

 Juan


 2015-07-30 10:50 GMT+02:00 Priya Ch learnings.chitt...@gmail.com:

 Hi All,

  Can someone throw insights on this ?

 On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch learnings.chitt...@gmail.com
 wrote:



 Hi TD,

  Thanks for the info. I have the scenario like this.

  I am reading the data from kafka topic. Let's say kafka has 3
 partitions for the topic. In my streaming application, I would configure 3
 receivers with 1 thread each such that they would receive 3 dstreams (from
 3 partitions of kafka topic) and also I implement partitioner. Now there is
 a possibility of receiving messages with same primary key twice or more,
 one is at the time message is created and other times if there is an update
 to any fields for same message.

 If two messages M1 and M2 with same primary key are read by 2 receivers
 then even the partitioner in spark would still end up in parallel
 processing as there are altogether in different dstreams. How do we address
 in this situation ?

 Thanks,
 Padma Ch

 On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das t...@databricks.com
 wrote:

 You have to partition that data on the Spark Streaming by the primary
 key, and then make sure insert data into Cassandra atomically per key, or
 per set of keys in the partition. You can use the combination of the (batch
 time, and partition Id) of the RDD inside foreachRDD as the unique id for
 the data you are inserting. This will guard against multiple attempts to
 run the task that inserts into Cassandra.

 See
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations

 TD

 On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch 
 learnings.chitt...@gmail.com wrote:

 Hi All,

  I have a problem when writing streaming data to cassandra. Or
 existing product is on Oracle DB in which while wrtiting data, locks are
 maintained such that duplicates in the DB are avoided.

 But as spark has parallel processing architecture, if more than 1
 thread is trying to write same data i.e with same primary key, is there as
 any scope to created duplicates? If yes, how to address this problem 
 either
 from spark or from cassandra side ?

 Thanks,
 Padma Ch









Re: How to help for 1.5 release?

2015-08-04 Thread Akhil Das
I think you can start from here
https://issues.apache.org/jira/browse/SPARK/fixforversion/12332078/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel

Thanks
Best Regards

On Tue, Aug 4, 2015 at 12:02 PM, Meihua Wu rotationsymmetr...@gmail.com
wrote:

 I think the team is preparing for the 1.5 release. Anything to help with
 the QA, testing etc?

 Thanks,

 MW




How to help for 1.5 release?

2015-08-04 Thread Meihua Wu
I think the team is preparing for the 1.5 release. Anything to help with
the QA, testing etc?

Thanks,

MW


Re: Have Friedman's glmnet algo running in Spark

2015-08-04 Thread Patrick
I have a follow up on this: 
I see on JIRA that the idea of having a GLMNET implementation was more or
less abandoned, since a OWLQN implementation was chosen to construct a model
using L1/L2 regularization.   

However, GLMNET has the property of returning a multitide of models
(corresponding to different vales of penalty parameters [for the
regularization]). I think this is not the case in the OWLQN implementation. 
However, this would be really helpful to compare the accuracy of models with
different regParam values. 
As far as I understood, this would avoid to have a costly cross-validation
step over a possibly large set of regParam values. 


Joseph Bradley wrote
 Some of this discussion seems valuable enough to preserve on the JIRA; can
 we move it there (and copy any relevant discussions from previous emails
 as
 needed)?
 
 On Wed, Feb 25, 2015 at 10:35 AM, lt;

 mike@

 gt; wrote:





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Have-Friedman-s-glmnet-algo-running-in-Spark-tp10692p13587.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Have Friedman's glmnet algo running in Spark

2015-08-04 Thread mike
 My friends and I are continuing work on the algorithm. You are right that 
there are two elements to Friedman's glmnet algorithm. One is the use of 
coordinate descent for minimizing penalized regression with an absolute value 
penalty and the other is managing the regularization parameters. Friedmans 
algorithm does return the the entire regularization path. We have had to get 
fairly deep into the mechanics of linear algebra. The tricky part has been 
arranging the matrix and vector multiplications to minimize the compute times - 
(e.g. big time differences between multiplying by a submatrix versus 
mulbiplying by the columns in the submatrix, etc. )

All of the versions we've produced generate a multitude of solutions (default = 
100) for a range of different values of the regularization parameter. The 
solutions always cover the most heavily penalized end of the curve. The number 
of solutions generated depends on how fine the steps are and how close the 
solutions get to the fully saturated (un-penalized) solution. Default values 
for these work about 80% of the time.

Personally, i've always found it useful to have the entire regularization path. 
One way or another, that's always required to get a final solution. It's just a 
question of whether the points on the path are generated by hunting and pecking 
or done all in one shot systematically.
mike






-Original Message-
From: Patrick [mailto:petz2...@gmail.com]
Sent: Tuesday, August 4, 2015 12:50 AM
To: d...@sparapache.org
Subject: Re: Have Friedman's glmnet algo running in Spark

I have a follow up on this: I see on JIRA that the idea of having a GLMNET imp 
entation was more orless abandoned, since a OWLQN implementation was chosen to 
construct a modelusing L1/L2 regularization. However, GLMNET has the property 
of returning a multitide of models(corresponding to different vales of penalty 
parameters [for theregularization]). I think this is not the case in the OWLQN 
implementation. However, this would be really helpful to compare the accuracy 
of models withdifferent regParam values. As far as I understood, this would 
avoid to have a costly cross-validationstep over a possibly large set of 
regParam values. Joseph Bradley wrote Some of this discussion seems valuable 
enough to preserve on the JIRA; can we move it there (and copy any relevant 
discussions from previous emails as needed)?  On Wed, Feb 25, 2015 at 10:35 
AM,  mike@  wrote:--View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Have-Friedman-s-glmnet-algo-running-in-Spark-tp10692p13587.htmlSent
 from the Apache Spark Developers List mailing list archive at 
Nabble.com.-To
 unsubscribe, e-mail: dev-unsubscribe@spark.apache.orgFor additional commands, 
e-mail: dev-h...@spark.apache.org


Re: How to help for 1.5 release?

2015-08-04 Thread Patrick Wendell
Hey Meihua,

If you are a user of Spark, one thing that is really helpful is to run
Spark 1.5 on your workload and report any issues, performance
regressions, etc.

- Patrick

On Mon, Aug 3, 2015 at 11:49 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 I think you can start from here
 https://issues.apache.org/jira/browse/SPARK/fixforversion/12332078/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel

 Thanks
 Best Regards

 On Tue, Aug 4, 2015 at 12:02 PM, Meihua Wu rotationsymmetr...@gmail.com
 wrote:

 I think the team is preparing for the 1.5 release. Anything to help with
 the QA, testing etc?

 Thanks,

 MW



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



shane will be OOO 8-5-15 through 8-18-15

2015-08-04 Thread shane knapp
so i done gone and got myself hitched, and will be disappearing in to
the rainy island of kol chang in thailand for the next ~2 weeks.  :)

this means i will be completely out of contact, and have to leave
jenkins in the gentle hands of jon kuroda (a sysadmin here at the lab)
and matt massie (my boss).  they've been CCed on this email, and
briefed on the basic operations, so should be able to maintain things
while i'm gone.

i will ask that during these next couple of weeks that we hold off on
any major system changes and package installations, unless it's a
blocker and needed for any releases.  this is mostly my fault, as i've
not finished porting all of the bash setup scripts (of doom) to
ansible and i'd like to minimize feature drift.

anyways, i'll be back in a couple of weeks, so don't break anything.  :)

shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org