Re: Hadoop LR comparison

2014-03-31 Thread Tsai Li Ming
Thanks. What will be equivalent code in Hadoop where Spark published the 110s/0.9s comparison? On 1 Apr, 2014, at 2:44 pm, DB Tsai wrote: > Hi Li-Ming, > > This binary logistic regression using SGD is in > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mll

Re: Hadoop LR comparison

2014-03-31 Thread DB Tsai
Hi Li-Ming, This binary logistic regression using SGD is in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala We're working on multinomial logistic regression using Newton and L-BFGS optimizer now. Will be released soon

Hadoop LR comparison

2014-03-31 Thread Tsai Li Ming
Hi, Is the code available for Hadoop to calculate the Logistic Regression hyperplane? I’m looking at the Examples: http://spark.apache.org/examples.html, where there is the 110s vs 0.9s in Hadoop vs Spark comparison. Thanks!

advanced training or implementation assistance

2014-03-31 Thread Livni, Dana
I have some experience with writing applications using spark. I have also started to use YourKit to tried and profile my app, and tried to detect performance issues. But I'm not sure if I'm implementing the best practices or how to preform advanced profiling of the code. I'm looking for a ref

Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Bharath Bhushan
Another problem I noticed is that the current 1.0.0 git tree still gives me the ClassNotFoundException. I see that the SPARK-1052 is already fixed there. I then modified the pom.xml for mesos and protobuf and that still gave the ClassNotFoundException. I also tried modifying pom.xml only for mes

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Sonal Goyal
Hi Andy, I would be interested in setting up a meetup in Delhi/NCR, India. Can you please let me know how to go about organizing it? Best Regards, Sonal Nube Technologies On Tue, Apr 1, 2014 at 10:04 AM, giive chen wrote: > Hi

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread giive chen
Hi Andy We are from Taiwan. We are already planning to have a Spark meetup. We already have some resources like place and food budget. But we do need some other resource. Please contact me offline. Thanks Wisely Chen On Tue, Apr 1, 2014 at 1:28 AM, Andy Konwinski wrote: > Hi folks, > > We hav

Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Bharath Bhushan
I was talking about the protobuf version issue as not fixed. I could not find any reference to the problem or the fix. Reg. SPARK-1052, I could pull in the fix into my 0.9.0 tree (from the tar ball on the website) and I see the fix in the latest git. Thanks On 01-Apr-2014, at 3:28 am, deric w

Re: network wordcount example

2014-03-31 Thread Chris Fregly
@eric- i saw this exact issue recently while working on the KinesisWordCount. are you passing "local[2]" to your example as the MASTER arg versus just "local" or "local[1]"? you need at least 2. it's documented as "n>1" in the scala source docs - which is easy to mistake for n>=1. i just ran t

Calling Spark enthusiasts in Austin, TX

2014-03-31 Thread Ognen Duzlevski
In the spirit of everything being bigger and better in TX ;) => if anyone is in Austin and interested in meeting up over Spark - contact me! There seems to be a Spark meetup group in Austin that has never met and my initial email to organize the first gathering was never acknowledged. Ognen On

Re: batching the output

2014-03-31 Thread Patrick Wendell
Ya this is a good way to do it. On Sun, Mar 30, 2014 at 10:11 PM, Vipul Pandey wrote: > Hi, > > I need to batch the values in my final RDD before writing out to hdfs. The > idea is to batch multiple "rows" in a protobuf and write those batches out > - mostly to save some space as a lot of metad

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-03-31 Thread Patrick Wendell
Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be getting pulled in unless you are directly using akka yourself. Are you? Does your project have other dependencies that might be indirectly pulling in protobuf 2.4.1? It would be helpful if you could list all of your depende

Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread deric
Which repository do you use? The issue should be fixed in 0.9.1 and 1.0.0 https://spark-project.atlassian.net/browse/SPARK-1052 There's an old repository https://github.com/apache/incubator-spark and as Spark become one of top level pr

Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Bharath Bhushan
Your suggestion took me past the ClassNotFoundException. I then hit akka.actor.ActorNotFound exception. I patched in PR 568 into my 0.9.0 spark codebase and everything worked. So thanks a lot, Tim. Is there a JIRA/PR for the protobuf issue? Why is it not fixed in the latest git tree? Thanks.

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Denny Lee
If you have any questions on helping to get a Spark Meetup off the ground, please do not hesitate to ping me (denny.g@gmail.com).  I helped jump start the one here in Seattle (and tangentially have been helping the Vancouver and Denver ones as well).  HTH! On March 31, 2014 at 12:35:38 PM,

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Patrick Grinaway
Also in NYC, definitely interested in a spark meetup! Sent from my iPhone > On Mar 31, 2014, at 3:07 PM, Jeremy Freeman wrote: > > Happy to help with an NYC meet up (just emailed Andy). I recently moved to > VA, but am back in NYC quite often, and have been turning several > computational peo

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Jeremy Freeman
Happy to help with an NYC meet up (just emailed Andy). I recently moved to VA, but am back in NYC quite often, and have been turning several computational people at Columbia / NYU / Simons Foundation onto Spark; there'd definitely be interest in those communities. -- Jeremy ---

Re: Calling Spahk enthusiasts in Boston

2014-03-31 Thread Nick Pentreath
I would offer to host one in Cape Town but we're almost certainly the only Spark users in the country apart from perhaps one in Johanmesburg :)— Sent from Mailbox for iPhone On Mon, Mar 31, 2014 at 8:53 PM, Nicholas Chammas wrote: > My fellow Bostonians and New Englanders, > We cannot allow New

Calling Spahk enthusiasts in Boston

2014-03-31 Thread Nicholas Chammas
My fellow Bostonians and New Englanders, We cannot allow New York to beat us to having a banging Spark meetup. Respond to me (and I guess also Andy?) if you are interested. Yana, I'm not sure either what is involved in organizing, but we can figure it out. I didn't know about the meetup that ne

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Yana Kadiyska
Nicholas, I'm in Boston and would be interested in a Spark group. Not sure if you know this -- there was a meetup that never got off the ground. Anyway, I'd be +1 for attending. Not sure what is involved in organizing. Seems a shame that a city like Boston doesn't have one. On Mon, Mar 31, 2014 at

Re: how spark dstream handles congestion?

2014-03-31 Thread Dong Mo
Thanks -Mo 2014-03-31 13:16 GMT-05:00 Evgeny Shishkin : > > On 31 Mar 2014, at 21:05, Dong Mo wrote: > > > Dear list, > > > > I was wondering how Spark handles congestion when the upstream is > generating dstreams faster than downstream workers can handle? > > It will eventually OOM. > >

Re: how spark dstream handles congestion?

2014-03-31 Thread Evgeny Shishkin
On 31 Mar 2014, at 21:05, Dong Mo wrote: > Dear list, > > I was wondering how Spark handles congestion when the upstream is generating > dstreams faster than downstream workers can handle? It will eventually OOM.

how spark dstream handles congestion?

2014-03-31 Thread Dong Mo
Dear list, I was wondering how Spark handles congestion when the upstream is generating dstreams faster than downstream workers can handle? Thanks -Mo

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread François Le Lay
Hi Andy, NYC speaking! Pretty sure we can come up with something here. Let's discuss offline! François On Mon, Mar 31, 2014 at 1:52 PM, Chris Gore wrote: > We'd love to see a Spark user group in Los Angeles and connect with others > working with it here. > > Ping me if you're in the LA area

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Nicholas Chammas
As in, I am interested in helping organize a Spark meetup in the Boston area. On Mon, Mar 31, 2014 at 2:00 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Well, since this thread has played out as it has, lemme throw in a > shout-out for Boston. > > > On Mon, Mar 31, 2014 at 1:52 PM,

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Nicholas Chammas
Well, since this thread has played out as it has, lemme throw in a shout-out for Boston. On Mon, Mar 31, 2014 at 1:52 PM, Chris Gore wrote: > We'd love to see a Spark user group in Los Angeles and connect with others > working with it here. > > Ping me if you're in the LA area and use Spark at

Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Tim St Clair
It sounds like the protobuf issue. So FWIW, You might want to try updating the 0.9.0 w/pom mods for mesos & protobuf. mesos 0.17.0 & protobuf 2.5 Cheers, Tim - Original Message - > From: "Bharath Bhushan" > To: user@spark.apache.org > Sent: Monday, March 31, 2014 9:46:32 AM > Sub

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Chris Gore
We'd love to see a Spark user group in Los Angeles and connect with others working with it here. Ping me if you're in the LA area and use Spark at your company ( ch...@retentionscience.com ). Chris Retention Science call: 734.272.3099 visit: Site | like: Facebook | follow: Twitter On Mar 31,

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Andy Konwinski
Responses about London, Montreal/Toronto, DC, Chicago. Great coverage so far, and keep 'em coming! (still looking for an NYC connection) I'll reply to each of you off-list to coordinate next-steps for setting up a Spark meetup in your home area. Thanks again, this is super exciting. Andy On Mo

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Anurag Dodeja
How about Chicago? On Mon, Mar 31, 2014 at 12:38 PM, Nan Zhu wrote: > Montreal or Toronto? > > > On Mon, Mar 31, 2014 at 1:36 PM, Martin Goodson wrote: > >> How about London? >> >> >> -- >> Martin Goodson | VP Data Science >> (0)20 3397 1240 >> [image: Inline image 1] >> >> >> On Mon, Mar 31,

Re: network wordcount example

2014-03-31 Thread Diana Carroll
Not sure what data you are sending in. You could try calling "lines.print()" instead which should just output everything that comes in on the stream. Just to test that your socket is receiving what you think you are sending. On Mon, Mar 31, 2014 at 12:18 PM, eric perler wrote: > Hello > > i ju

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Nan Zhu
Montreal or Toronto? On Mon, Mar 31, 2014 at 1:36 PM, Martin Goodson wrote: > How about London? > > > -- > Martin Goodson | VP Data Science > (0)20 3397 1240 > [image: Inline image 1] > > > On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski > wrote: > >> Hi folks, >> >> We have seen a lot of com

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Martin Goodson
How about London? -- Martin Goodson | VP Data Science (0)20 3397 1240 [image: Inline image 1] On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski wrote: > Hi folks, > > We have seen a lot of community growth outside of the Bay Area and we are > looking to help spur even more! > > For starters,

Calling Spark enthusiasts in NYC

2014-03-31 Thread Andy Konwinski
Hi folks, We have seen a lot of community growth outside of the Bay Area and we are looking to help spur even more! For starters, the organizers of the Spark meetups here in the Bay Area want to help anybody that is interested in setting up a meetup in a new city. Some amazing Spark champions ha

Re: Best practices: Parallelized write to / read from S3

2014-03-31 Thread Nicholas Chammas
OK sweet. Thanks for walking me through that. I wish this were StackOverflow so I could bestow some nice rep on all you helpful people. On Mon, Mar 31, 2014 at 1:06 PM, Aaron Davidson wrote: > Note that you may have minSplits set to more than the number of cores in > the cluster, and Spark wil

Re: Best practices: Parallelized write to / read from S3

2014-03-31 Thread Aaron Davidson
Note that you may have minSplits set to more than the number of cores in the cluster, and Spark will just run as many as possible at a time. This is better if certain nodes may be slow, for instance. In general, it is not necessarily the case that doubling the number of cores doing IO will double

Re: Error in SparkSQL Example

2014-03-31 Thread Michael Armbrust
> Thanks for the clarification. My question is about the error above "error: > class $iwC needs to be abstract" > This is a fairly confusing scala REPL (interpreter) error. Under the covers, to run the line you entered into the interpreter, scala is creating an object called $iwC with your code i

Re: Best practices: Parallelized write to / read from S3

2014-03-31 Thread Nicholas Chammas
So setting minSplits will set the parallelism on the read in SparkContext.textFile(), assuming I have the cores in the cluster to deliver that level of parallelism. And if I don't explicitly

Re: Best practices: Parallelized write to / read from S3

2014-03-31 Thread Aaron Davidson
Spark will only use each core for one task at a time, so doing sc.textFile(, ) where you set "num reducers" to at least as many as the total number of cores in your cluster, is about as fast you can get out of the box. Same goes for saveAsTextFile. On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Cham

Re: groupBy RDD does not have grouping column ?

2014-03-31 Thread Manoj Samel
Thanks, that works. It wasn't clear if the second part is just the aggregate specification or any expression. On Mon, Mar 31, 2014 at 9:03 AM, Michael Armbrust wrote: > This is similar to how SQL works, items in the GROUP BY clause are not > included in the output by default. You will need to

Re: Error in SparkSQL Example

2014-03-31 Thread Manoj Samel
Hi Michael, Thanks for the clarification. My question is about the error above "error: class $iwC needs to be abstract" and what does the RDD brings, since I can do the DSL without the "people: people: org.apache.spark.rdd.RDD[Person]" Thanks, On Mon, Mar 31, 2014 at 9:13 AM, Michael Armbrust w

network wordcount example

2014-03-31 Thread eric perler
Hello i just started working with spark today... and i am trying to run the wordcount network example i created a socket server and client.. and i am sending data to the server in an infinite loop when i run the spark class.. i see this output in the console... ---

Re: Error in SparkSQL Example

2014-03-31 Thread Michael Armbrust
"val people: RDD[Person] // An RDD of case class objects, from the first example." is just a placeholder to avoid cluttering up each example with the same code for creating an RDD. The ": RDD[People]" is just there to let you know the expected type of the variable 'people'. Perhaps there is a cle

Re: SparkSQL "where" with BigDecimal type gives stacktrace

2014-03-31 Thread Michael Armbrust
This was not intentional, here is a JIRA https://issues.apache.org/jira/browse/SPARK-1364 Note that you can create big decimals by using the Decimal type in a HiveContext. Date is not yet a supported data type. On Sun, Mar 30, 2014 at 5:35 PM, Manoj Samel wrote: > Hi, > > Would the same issue

Re: groupBy RDD does not have grouping column ?

2014-03-31 Thread Michael Armbrust
This is similar to how SQL works, items in the GROUP BY clause are not included in the output by default. You will need to include 'a in the second parameter list (which is similar to the SELECT clause) as well if you want it included in the output. On Sun, Mar 30, 2014 at 9:52 PM, Manoj Samel w

Re: Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-31 Thread Michael Armbrust
> > * unionAll preserve duplicate v/s union that does not > This is true, if you want to eliminate duplicate items you should follow the union with a distinct() > * SQL union and unionAll result in same output format i.e. another SQL v/s > different RDD types here. > * Understand the existing un

Best practices: Parallelized write to / read from S3

2014-03-31 Thread Nicholas Chammas
Howdy-doody, I have a single, very large file sitting in S3 that I want to read in with sc.textFile(). What are the best practices for reading in this file as quickly as possible? How do I parallelize the read as much as possible? Similarly, say I have a single, very large RDD sitting in memory t

Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Bharath Bhushan
I tried 0.9.0 and the latest git tree of spark. For mesos, I tried 0.17.0 and the latest git tree. Thanks On 31-Mar-2014, at 7:24 pm, Tim St Clair wrote: > What versions are you running? > > There is a known protobuf 2.5 mismatch, depending on your versions. > > Cheers, > Tim > > -

yarn.application.classpath in yarn-site.xml

2014-03-31 Thread Dan
Hi, I've just tested spark in yarn mode, but something made me confused. When I *delete* the "yarn.application.classpath" configuration in yarn-site.xml, the following command works well. *bin/spark-class org.apache.spark.deploy.yarn.Client --jar examples/target/scala-2.10/spark-examples_2.10-ass

Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Tim St Clair
What versions are you running? There is a known protobuf 2.5 mismatch, depending on your versions. Cheers, Tim - Original Message - > From: "Bharath Bhushan" > To: user@spark.apache.org > Sent: Monday, March 31, 2014 8:16:19 AM > Subject: java.lang.ClassNotFoundException - spark on m

java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Bharath Bhushan
I am facing different kinds of java.lang.ClassNotFoundException when trying to run spark on mesos. One error has to do with org.apache.spark.executor.MesosExecutorBackend. Another has to do with org.apache.spark.serializer.JavaSerializer. I see other people complaining about similar issues. I

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-31 Thread pradeeps8
Hi Sonal, There are no custom objects in saveRDD, it is of type RDD[(String, String)]. Thanks, Pradeep -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3508.html Sent from the Apache Spa

Re: Task not serializable?

2014-03-31 Thread Daniel Liu
Hi I am new to Spark and I encountered this error when I try to map RDD[A] => RDD[Array[Double]] then collect the results. A is a custom class extends Serializable. (Actually it's just a wrapper class which wraps a few variables that are all serializable). I also tried KryoSerializer according t