Re: Dedup

2014-10-08 Thread Sonal Goyal
What is your data like? Are you looking at exact matching or are you interested in nearly same records? Do you need to merge similar records to get a canonical value? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Oct 9, 2014 at 2:31

Running GraphX through Java

2014-08-13 Thread Sonal Goyal
Hi, I am trying to run and test some graph apis using Java. I started with connected components, here is my code. JavaRDDEdgeLong vertices; ///code to populate vertices .. .. ClassTagLong longTag = scala.reflect.ClassTag$.MODULE$.apply(Long.class); ClassTagFloat floatTag =

Re: Running GraphX through Java

2014-08-13 Thread Sonal Goyal
Hi All, Sorry reposting this again in the hope to get some clues. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Wed, Aug 13, 2014 at 3:53 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi, I am trying to run and test some graph apis

Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread Sonal Goyal
You can take a look at https://github.com/apache/spark/blob/master/core/src/test/java/org/apache/spark/JavaAPISuite.java and model your junits based on it. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Tue, Jul 29, 2014 at 10:10 PM,

Re: Can Spark stack scale to petabyte scale without performance degradation?

2014-07-15 Thread Sonal Goyal
Hi Rohit, I think the 3rd question on the FAQ may help you. https://spark.apache.org/faq.html Some other links that talk about building bigger clusters and processing more data: http://spark-summit.org/wp-content/uploads/2014/07/Building-1000-node-Spark-Cluster-on-EMR.pdf

Re: Powered by Spark addition

2014-06-21 Thread Sonal Goyal
Thanks a lot Matei. Sent from my iPad On Jun 22, 2014, at 5:20 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Alright, added you — sorry for the delay. Matei On Jun 12, 2014, at 10:29 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi, Can we get added too? Here are the details

Re: How do you run your spark app?

2014-06-19 Thread Sonal Goyal
We use maven for building our code and then invoke spark-submit through the exec plugin, passing in our parameters. Works well for us. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Jun 20, 2014 at 3:26 AM, Michael Cutler

Re: How to use spark-submit

2014-05-12 Thread Sonal Goyal
ADD_JARS, --driver-class-path and combinations of extraClassPath. I have deferred that ad-hoc approach to finding a systematic one. 2014-05-08 5:26 GMT-07:00 Sonal Goyal sonalgoy...@gmail.com: I am creating a jar with only my dependencies and run spark-submit through my project mvn build. I

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Sonal Goyal
Hi Andy, I would be interested in setting up a meetup in Delhi/NCR, India. Can you please let me know how to go about organizing it? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Tue, Apr 1, 2014 at 10:04 AM, giive chen

Re: Do all classes involving RDD operation need to be registered?

2014-03-29 Thread Sonal Goyal
From my limited knowledge, all classes involved with the RDD operations should be extending Serializable if you want Java serialization(default). However, if you want Kryo serialization, you can use conf.set(spark.serializer,org.apache.spark.serializer.KryoSerializer); If you also want to perform

Re: Zip or map elements to create new RDD

2014-03-29 Thread Sonal Goyal
zipWithIndex works on the git clone, not sure if its part of a released version. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Mar 29,

Re: Replicating RDD elements

2014-03-28 Thread Sonal Goyal
Hi David, I am sorry but your question is not clear to me. Are you talking about taking some value and sharing it across your cluster so that it is present on all the nodes? You can look at Spark's broadcasting in that case. On the other hand, if you want to take one item and create an RDD of 100

Re: Not getting it

2014-03-28 Thread Sonal Goyal
Have you tried setting the partitioning ? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Mar 27, 2014 at 10:04 AM, lannyripple lanny.rip...@gmail.comwrote: Hi all, I've got something which I think should be straightforward but

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-28 Thread Sonal Goyal
What does your saveRDD contain? If you are using custom objects, they should be serializable. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Mar 29, 2014 at 12:02 AM, pradeeps8 srinivasa.prad...@gmail.comwrote: Hi Aureliano, I

<    1   2