Re: Replicating RDD elements

2014-03-28 Thread Sonal Goyal
Hi David, I am sorry but your question is not clear to me. Are you talking about taking some value and sharing it across your cluster so that it is present on all the nodes? You can look at Spark's broadcasting in that case. On the other hand, if you want to take one item and create an RDD of 100

Re: Not getting it

2014-03-28 Thread Sonal Goyal
Have you tried setting the partitioning ? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Mar 27, 2014 at 10:04 AM, lannyripple lanny.rip...@gmail.comwrote: Hi all, I've got something which I think should be straightforward but

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-28 Thread Sonal Goyal
What does your saveRDD contain? If you are using custom objects, they should be serializable. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Mar 29, 2014 at 12:02 AM, pradeeps8 srinivasa.prad...@gmail.comwrote: Hi Aureliano, I

Re: Do all classes involving RDD operation need to be registered?

2014-03-29 Thread Sonal Goyal
From my limited knowledge, all classes involved with the RDD operations should be extending Serializable if you want Java serialization(default). However, if you want Kryo serialization, you can use conf.set(spark.serializer,org.apache.spark.serializer.KryoSerializer); If you also want to perform

Re: Zip or map elements to create new RDD

2014-03-29 Thread Sonal Goyal
zipWithIndex works on the git clone, not sure if its part of a released version. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Mar 29,

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Sonal Goyal
Hi Andy, I would be interested in setting up a meetup in Delhi/NCR, India. Can you please let me know how to go about organizing it? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Tue, Apr 1, 2014 at 10:04 AM, giive chen

Re: How to use spark-submit

2014-05-12 Thread Sonal Goyal
ADD_JARS, --driver-class-path and combinations of extraClassPath. I have deferred that ad-hoc approach to finding a systematic one. 2014-05-08 5:26 GMT-07:00 Sonal Goyal sonalgoy...@gmail.com: I am creating a jar with only my dependencies and run spark-submit through my project mvn build. I

Re: How do you run your spark app?

2014-06-19 Thread Sonal Goyal
We use maven for building our code and then invoke spark-submit through the exec plugin, passing in our parameters. Works well for us. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Jun 20, 2014 at 3:26 AM, Michael Cutler

Re: Powered by Spark addition

2014-06-21 Thread Sonal Goyal
Thanks a lot Matei. Sent from my iPad On Jun 22, 2014, at 5:20 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Alright, added you — sorry for the delay. Matei On Jun 12, 2014, at 10:29 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi, Can we get added too? Here are the details

Re: Can Spark stack scale to petabyte scale without performance degradation?

2014-07-15 Thread Sonal Goyal
Hi Rohit, I think the 3rd question on the FAQ may help you. https://spark.apache.org/faq.html Some other links that talk about building bigger clusters and processing more data: http://spark-summit.org/wp-content/uploads/2014/07/Building-1000-node-Spark-Cluster-on-EMR.pdf

Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread Sonal Goyal
You can take a look at https://github.com/apache/spark/blob/master/core/src/test/java/org/apache/spark/JavaAPISuite.java and model your junits based on it. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Tue, Jul 29, 2014 at 10:10 PM,

Running GraphX through Java

2014-08-13 Thread Sonal Goyal
Hi, I am trying to run and test some graph apis using Java. I started with connected components, here is my code. JavaRDDEdgeLong vertices; ///code to populate vertices .. .. ClassTagLong longTag = scala.reflect.ClassTag$.MODULE$.apply(Long.class); ClassTagFloat floatTag =

Re: Running GraphX through Java

2014-08-13 Thread Sonal Goyal
Hi All, Sorry reposting this again in the hope to get some clues. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Wed, Aug 13, 2014 at 3:53 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi, I am trying to run and test some graph apis

Re: Dedup

2014-10-08 Thread Sonal Goyal
What is your data like? Are you looking at exact matching or are you interested in nearly same records? Do you need to merge similar records to get a canonical value? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Oct 9, 2014 at 2:31

Re: Join with large data set

2014-10-17 Thread Sonal Goyal
Hi Ankur, If your rdds have common keys, you can look at partitioning both your datasets using a custom partitioner based on keys so that you can avoid shuffling and optimize join performance. HTH Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal

Re: key class requirement for PairedRDD ?

2014-10-17 Thread Sonal Goyal
We use our custom classes which are Serializable and have well defined hashcode and equals methods through the Java API. Whats the issue you are getting? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 17, 2014 at 12:28 PM, Jaonary

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Sonal Goyal
Cartesian joins of large datasets are usually going to be slow. If there is a way you can reduce the problem space to make sure you only join subsets with each other, that may be helpful. Maybe if you explain your problem in more detail, people on the list can come up with more suggestions. Best

Re: Rdd of Rdds

2014-10-22 Thread Sonal Goyal
Another approach could be to create artificial keys for each RDD and convert to PairRDDs. So your first RDD becomes JavaPairRDDInt,String rdd1 with values 1,1 ; 1,2 and so on Second RDD becomes rdd2 is 2, a; 2, b;2,c You can union the two RDDs, groupByKey, countByKey etc and maybe achieve what

Java api overhead?

2014-10-27 Thread Sonal Goyal
Hi, I wanted to understand what kind of memory overheads are expected if at all while using the Java API. My application seems to have a lot of live Tuple2 instances and I am hitting a lot of gc so I am wondering if I am doing something fundamentally wrong. Here is what the top of my heap looks

Re: Java api overhead?

2014-10-29 Thread Sonal Goyal
(and by default tries to work with all data in memory) and its written in Scala seeing lots of scala Tuple2 is not unexpected. how do these numbers relate to your data size? On Oct 27, 2014 2:26 PM, Sonal Goyal sonalgoy...@gmail.com javascript:_e(%7B%7D,'cvml','sonalgoy...@gmail.com'); wrote

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sonal Goyal
Hey Sameer, Wouldnt local[x] run count parallelly in each of the x threads? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Oct 30, 2014 at 11:42 PM, Sameer Farooqui same...@databricks.com wrote: Hi Shahab, Are you running Spark

Re: Submiting Spark application through code

2014-10-31 Thread Sonal Goyal
What do your worker logs say? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 31, 2014 at 11:44 AM, sivarani whitefeathers...@gmail.com wrote: I tried running it but dint work public static final SparkConf batchConf= new

Re: Using a Database to persist and load data from

2014-10-31 Thread Sonal Goyal
I think you can try to use the Hadoop DBOutputFormat Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 31, 2014 at 1:00 PM, Kamal Banga ka...@sigmoidanalytics.com wrote: You can also use PairRDDFunctions' saveAsNewAPIHadoopFile

Re: A Spark Design Problem

2014-10-31 Thread Sonal Goyal
Does the following help? JavaPairRDDbin,key join with JavaPairRDDbin,lock If you partition both RDDs by the bin id, I think you should be able to get what you want. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 31, 2014 at

Re: LinearRegression and model prediction threshold

2014-10-31 Thread Sonal Goyal
You can serialize the model to a local/hdfs file system and use it later when you want. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Nov 1, 2014 at 12:02 AM, Sean Owen so...@cloudera.com wrote: It sounds like you are asking about

Re: Does spark works on multicore systems?

2014-11-09 Thread Sonal Goyal
Also, the level of parallelism would be affected by how big your input is. Could this be a problem in your case? On Sunday, November 9, 2014, Aaron Davidson ilike...@gmail.com wrote: oops, meant to cc userlist too On Sat, Nov 8, 2014 at 3:13 PM, Aaron Davidson ilike...@gmail.com

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Sonal Goyal
I believe the Spark Job Server by Ooyala can help you share data across multiple jobs, take a look at http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It seems to fit closely to what you need. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co

Re: Which function in spark is used to combine two RDDs by keys

2014-11-13 Thread Sonal Goyal
Check cogroup. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Nov 13, 2014 at 5:11 PM, Blind Faith person.of.b...@gmail.com wrote: Let us say I have the following two RDDs, with the following key-pair values. rdd1 =

Re: Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Sonal Goyal
Hi Darin, In our case, we were getting the error gue to long GC pauses in our app. Fixing the underlying code helped us remove this error. This is also mentioned as point 1 in the link below:

Re: MLLib in Production

2014-12-10 Thread Sonal Goyal
You can also serialize the model and use it in other places. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Wed, Dec 10, 2014 at 5:32 PM, Yanbo Liang yanboha...@gmail.com wrote: Hi Klaus, There is no ideal method but some

Re: [Graphx Spark] Error of Lost executor and TimeoutException

2015-02-02 Thread Sonal Goyal
/in/sonalgoyal On Sat, Jan 31, 2015 at 4:21 AM, Yifan LI iamyifa...@gmail.com wrote: Yes, I think so, esp. for a pregel application… have any suggestion? Best, Yifan LI On 30 Jan 2015, at 22:25, Sonal Goyal sonalgoy...@gmail.com wrote: Is your code hitting frequent garbage collection

Re: [Graphx Spark] Error of Lost executor and TimeoutException

2015-01-30 Thread Sonal Goyal
Is your code hitting frequent garbage collection? Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Jan 30, 2015 at 7:52 PM, Yifan LI iamyifa...@gmail.com wrote: Hi, I am running my graphx application on Spark 1.2.0(11

Re: dependencies on java-netlib and jblas

2015-05-08 Thread Sonal Goyal
Hi John, I have been using MLLIB without installing jblas native dependence. Functionally I have not got stuck. I still need to explore if there are any performance hits. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, May 8,

Re: Use with Data justifying Spark

2015-04-01 Thread Sonal Goyal
Maybe check the examples? http://spark.apache.org/examples.html Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Wed, Apr 1, 2015 at 8:31 PM, Vila, Didier didier.v...@teradata.com wrote: Good Morning All, I would like to use

Re: How to process data in chronological order

2015-05-21 Thread Sonal Goyal
Would partitioning your data based on the key and then running mapPartitions help? Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, May 21, 2015 at 4:33 AM, roy rp...@njit.edu wrote: I have a key-value RDD, key is a timestamp

Re: ReduceByKey with a byte array as the key

2015-06-11 Thread Sonal Goyal
I think if you wrap the byte[] into an object and implement equals and hashcode methods, you may be able to do this. There will be the overhead of extra object, but conceptually it should work unless I am missing something. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co

Re: grpah x issue spark 1.3

2015-08-17 Thread Sonal Goyal
I have been using graphx in production on 1.3 and 1.4 with no issues. What's the exception you see and what are you trying to do? On Aug 17, 2015 10:49 AM, dizzy5112 dave.zee...@gmail.com wrote: Hi using spark 1.3 and trying some sample code: when i run: all works well but with it falls

Re: spark cluster setup

2015-08-03 Thread Sonal Goyal
org.apache.spark.SecurityManager: Changing view acls to: root Thanks. On Mon, Aug 3, 2015 at 11:52 AM, Sonal Goyal sonalgoy...@gmail.com wrote: What do the master logs show? Best Regards, Sonal Founder, Nube Technologies http://t.sidekickopen13.com/e1t/c/5

Re: spark cluster setup

2015-08-02 Thread Sonal Goyal
What do the master logs show? Best Regards, Sonal Founder, Nube Technologies http://t.sidekickopen13.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs1pNkJdVdDLZW1q7zBxW64k9XR56dLFLf58_ZT802?t=http%3A%2F%2Fwww.nubetech.co%2Fsi=5462006004973568pi=903294d1-e4a2-4926-cf03-b51cc168cfc1 Check out

Re: All masters are unresponsive! Giving up.

2015-08-07 Thread Sonal Goyal
There seems to be a version mismatch somewhere. You can try and find out the cause with debug serialization information. I think the jvm flag -Dsun.io.serialization.extendedDebugInfo=true should help. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co Check out Reifier at

Re: many-to-many join

2015-07-22 Thread Sonal Goyal
If I understand this correctly, you could join area_code_user and area_code_state and then flat map to get user, areacode, state. Then groupby/reduce by user. You can also try some join optimizations like partitioning on area code or broadcasting smaller table depending on size of

Re: does spark ML have some thing like createDataPartition() in R caret package ?

2015-11-13 Thread Sonal Goyal
The RDD has a takeSample method where you can supply the flag for replacement or not as well as the fraction to sample. On Nov 14, 2015 2:51 AM, "Andy Davidson" wrote: > In R, its easy to split a data set into training, crossValidation, and > test set. Is there

Re: spark-submit stuck and no output in console

2015-11-17 Thread Sonal Goyal
I would suggest a couple of things to try A. Try running the example program with master as local[*]. See if spark can run locally or not. B. Check spark master and worker logs. C. Check if normal hadoop jobs can be run properly on the cluster. D. Check spark master webui and see health of

Re: In-memory computing and cache() in Spark

2015-10-18 Thread Sonal Goyal
Hi Jia, RDDs are cached on the executor, not on the driver. I am assuming you are running locally and haven't changed spark.executor.memory? Sonal On Oct 19, 2015 1:58 AM, "Jia Zhan" wrote: Anyone has any clue what's going on.? Why would caching with 2g memory much faster

Re: java.io.InvalidClassException using spark1.4.1 for Terasort

2015-10-14 Thread Sonal Goyal
This is probably a versioning issue, are you sure your code is compiling and running against the same versions? On Oct 14, 2015 2:19 PM, "Shreeharsha G Neelakantachar" < shreeharsh...@in.ibm.com> wrote: > Hi, > I have Terasort being executed on spark1.4.1 with hadoop 2.7 for a > datasize of

Re: Spark: How to find similar text title

2015-10-20 Thread Sonal Goyal
Do you want to compare within the rdd or do you have some external list or data coming in ? For matching, you could look at string edit distances or cosine similarity if you are only comparing title strings. On Oct 20, 2015 9:09 PM, "Ascot Moss" wrote: > Hi, > > I have my

Re: how can evenly distribute my records in all partition

2015-11-17 Thread Sonal Goyal
Think about how you want to distribute your data and how your keys are spread currently. Do you want to compute something per day, per week etc. Based on that, return a partition number. You could use mod 30 or some such function to get the partitions. On Nov 18, 2015 5:17 AM, "prateek arora"

Re: spark-submit stuck and no output in console

2015-11-17 Thread Sonal Goyal
o all the datanodes. That process is still running > without hassle and it's only using 1.3 GB of 1.7g heap space. > > Initially, I submitted 2 jobs to the YARN cluster which was running for 2 > days and suddenly stops. Nothing in the logs shows the root cause. > > > On Tue,

Re: spark-submit stuck and no output in console

2015-11-17 Thread Sonal Goyal
Could it be jdk related ? Which version are you on? Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: Question on take function - Spark Java API

2015-08-26 Thread Sonal Goyal
You can try using wholeTextFile which will give you a pair rdd of fileName, content. flatMap through this and manipulate the content. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co Check out Reifier at Spark Summit 2015

Re: Spark

2015-08-24 Thread Sonal Goyal
I think you could try sorting the endPointsCount and then doing a take. This should be a distributed process and only the result would get returned to the driver. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co Check out Reifier at Spark Summit 2015

Re: Spark

2015-08-25 Thread Sonal Goyal
August 2015 11:10 AM, Sonal Goyal sonalgoy...@gmail.com wrote: I think you could try sorting the endPointsCount and then doing a take. This should be a distributed process and only the result would get returned to the driver. Best Regards, Sonal Founder, Nube Technologies http

Re: Web UI is not showing up

2015-09-01 Thread Sonal Goyal
The web ui is at port 8080. 4040 will show up something when you have a running job or if you have configured history server. On Sep 1, 2015 8:57 PM, "Sunil Rathee" wrote: > > Hi, > > > localhost:4040 is not showing anything on the browser. Do we have to start > some

Re: Web UI is not showing up

2015-09-01 Thread Sonal Goyal
? > > On Tue, Sep 1, 2015 at 9:04 PM, Sonal Goyal <sonalgoy...@gmail.com> wrote: > >> The web ui is at port 8080. 4040 will show up something when you have a >> running job or if you have configured history server. >> On Sep 1, 2015 8:57 PM, "Sunil Rathee&qu

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Sonal Goyal
From what I have understood, you probably need to convert your vector to breeze and do your operations there. Check stackoverflow.com/questions/28232829/addition-of-two-rddmllib-linalg-vectors On Aug 25, 2015 7:06 PM, Kristina Rogale Plazonic kpl...@gmail.com wrote: Hi all, I'm still not clear

Re: Any quick method to sample rdd based on one filed?

2015-08-28 Thread Sonal Goyal
Filter into true rdd and false rdd. Union true rdd and sample of false rdd. On Aug 28, 2015 2:57 AM, Gavin Yue yue.yuany...@gmail.com wrote: Hey, I have a RDD[(String,Boolean)]. I want to keep all Boolean: True rows and randomly keep some Boolean:false rows. And hope in the final result,

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Sonal Goyal
I think you should be able to join different rdds with same key. Have you tried that? On Dec 1, 2015 3:30 PM, "Praveen Chundi" wrote: > cogroup could be useful to you, since all three are PairRDD's. > > >

Re: Datastore for GrpahX

2015-11-22 Thread Sonal Goyal
For graphx, you should be able to read and write data from practically any datastore Spark supports - flat files, rdbms, hadoop etc. If you want to save your graph as it is, check something like Neo4j. http://neo4j.com/developer/apache-spark/ Best Regards, Sonal Founder, Nube Technologies

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Sonal Goyal
You can look at ways to group records from both rdds together instead of doing Cartesian. Say generate pair rdd from each with first letter as key. Then do a partition and a join. On May 25, 2016 8:04 PM, "Priya Ch" wrote: > Hi, > RDD A is of size 30MB and RDD B

Re: GraphX Java API

2016-05-31 Thread Sonal Goyal
Its very much possible to use GraphX through Java, though some boilerplate may be needed. Here is an example. Create a graph from edge and vertex RDD (JavaRDD> vertices, JavaRDD edges ) ClassTag longTag = scala.reflect.ClassTag$.MODULE$.apply(Long.class);

Re: Error while saving plots

2016-05-26 Thread Sonal Goyal
Does the path /home/njoshi/dev/outputs/test_/plots/ exist on the driver ? Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Sonal Goyal
What does your application do? Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: Getting prediction values in spark mllib

2016-02-11 Thread Sonal Goyal
Looks like you are doing binary classification and you are getting the label out. If you clear the model threshold, you should be able to get the raw score. On Feb 11, 2016 1:32 PM, "Chandan Verma" wrote: > > > Following is the code Snippet > > > > > >

Re: How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

2016-01-24 Thread Sonal Goyal
One thing you can also look at is to save your data in a way that can be accessed through file patterns. Eg by hour, zone etc so that you only load what you need. On Jan 24, 2016 10:00 PM, "Ilya Ganelin" wrote: > The solution I normally use is to zipWithIndex() and then use

Re: Mllib Logistic Regression performance relative to Mahout

2016-03-01 Thread Sonal Goyal
You can also check if you are caching your input so that features are not being read/computed every iteration. Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: How to add a custom jar file to the Spark driver?

2016-03-09 Thread Sonal Goyal
Hi Gerhard, I just stumbled upon some documentation on EMR - link below. Seems there is a -u option to add jars in S3 to your classpath, have you tried that ? http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html Best Regards, Sonal Founder, Nube

Re: Understanding the Web_UI 4040

2016-03-07 Thread Sonal Goyal
Maybe check the worker logs to see what's going wrong with it? On Mar 7, 2016 9:10 AM, "Angel Angel" wrote: > Hello Sir/Madam, > > > I am running the spark-sql application on the cluster. > In my cluster there are 3 slaves and one Master. > > When i saw the progress of

Re: Spark Mllib kmeans execution

2016-03-02 Thread Sonal Goyal
It will run distributed On Mar 2, 2016 3:00 PM, "Priya Ch" wrote: > Hi All, > > I am running k-means clustering algorithm. Now, when I am running the > algorithm as - > > val conf = new SparkConf > val sc = new SparkContext(conf) > . > . > val kmeans = new

Re: Spark for offline log processing/querying

2016-05-22 Thread Sonal Goyal
Hi Mat, I think you could also use spark SQL to query the logs. Hope the following link helps https://databricks.com/blog/2014/09/23/databricks-reference-applications.html On May 23, 2016 10:59 AM, "Mat Schaffer" wrote: > I'm curious about trying to use spark as a cheap/slow

Re: strange HashPartitioner behavior in Spark

2016-04-18 Thread Sonal Goyal
Are you specifying your spark master in the scala program? Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: Any reference of performance tuning on SparkSQL?

2016-07-28 Thread Sonal Goyal
I found some references at http://spark.apache.org/docs/latest/sql-programming-guide.html#performance-tuning http://apache-spark-user-list.1001560.n3.nabble.com/Performance-tuning-in-Spark-SQL-td21871.html HTH Best Regards, Sonal Founder, Nube Technologies Reifier at

Re: Extracting key word from a textual column

2016-08-02 Thread Sonal Goyal
Hi Mich, It seems like an entity resolution problem - looking at different representations of an entity - SAINSBURY in this case and matching them all together. How dirty is your data in the description - are there stop words like SACAT/SMKT etc you can strip off and get the base retailer entity

Re: Avoid Cartesian product in calculating a distance matrix?

2016-08-06 Thread Sonal Goyal
The general approach to the Cartesian problem is to first block or index your rows so that similar items fall in the same bucket, and then join within each bucket. Is that possible in your case? On Friday, August 5, 2016, Paschalis Veskos wrote: > Hello everyone, > > I am

Re: What are using Spark for

2016-08-02 Thread Sonal Goyal
Hi Rohit, You can check the powered by spark page for some real usage of Spark. https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark On Tuesday, August 2, 2016, Rohit L wrote: > Hi Everyone, > > I want to know the real world uses cases for which

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-02 Thread Sonal Goyal
Hi Jestin, Which of your actions is the bottleneck? Is it group by, count or the join? Or all of them? It may help to tune the most time consuming ask first. On Monday, August 1, 2016, Nikolay Zhebet wrote: > Yes, Spark always trying to deliver snippet of code to the data

Re: Possible to broadcast a function?

2016-06-29 Thread Sonal Goyal
Have you looked at Alluxio? (earlier tachyon) Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Sonal Goyal
Hi Tony, Would hash on the bid work for you? hash(cols: Column *): Column [image: Permalink]

Re: [Spark submit] getting error when use properties file parameter in spark submit

2016-09-06 Thread Sonal Goyal
Looks like a classpath issue - Caused by: java.lang.ClassNotFoundException: com.amazonaws.services.s3.AmazonS3 Are you using S3 somewhere? Are the required jars in place? Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World

Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-08 Thread Sonal Goyal
Are you looking at the worker logs or the driver? On Thursday, September 8, 2016, Nisha Menon wrote: > I have an RDD created as follows: > > *JavaPairRDD inputDataFiles = > sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");* > >

Re: Open source Spark based projects

2016-09-22 Thread Sonal Goyal
https://spark-packages.org/ Thanks, Sonal Nube Technologies On Thu, Sep 22, 2016 at 3:48 PM, Sean Owen wrote: > https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects > and maybe related

Re: javac - No such file or directory

2016-11-09 Thread Sonal Goyal
It looks to be an issue with the java compiler, is the jdk setup correctly? Please check your java installation. Thanks, Sonal Nube Technologies On Wed, Nov 9, 2016 at 7:13 PM, Andrew Holway < andrew.hol...@otternetworks.de>

Re: Monitoring the User Metrics for a long running Spark Job

2016-12-07 Thread Sonal Goyal
You can try updating metrics.properties for the sink of your choice. In our case, we add the following for getting application metrics in JSON format using http *.sink.reifier.class= org.apache.spark.metrics.sink.MetricsServlet Here, we have defined the sink with name reifier and its class is

Re: Adding worker dynamically in standalone mode

2017-05-15 Thread Sonal Goyal
If I remember correctly, just run the worker with master as current. On Monday, May 15, 2017, Seemanto Barua wrote: > Hi > > Is it possible to add a worker dynamically to the master in standalone > mode. If so can you please share the steps on how to ? > Thanks > --

Re: Reading PDF/text/word file efficiently with Spark

2017-05-23 Thread Sonal Goyal
Hi, Sorry it's not clear to me if you want help moving the data to the cluster or in defining the best structure of your files on the cluster for efficient processing. Are you on standalone or using hdfs? On Tuesday, May 23, 2017, docdwarf wrote: > tesmai4 wrote > > I am

Re: Efficient Spark-Submit planning

2017-09-12 Thread Sonal Goyal
Overall the defaults are sensible, but you definitely have to look at your application and optimise a few of them. I mostly refer to the following links when the job is slow or failing or we have more hardware which we see we are not utilizing. http://spark.apache.org/docs/latest/tuning.html

Re: Where can I get few GBs of sample data?

2017-09-28 Thread Sonal Goyal
Here are some links for public data sets https://aws.amazon.com/public-datasets/ https://www.springboard.com/blog/free-public-data-sets-data-science-project/ Thanks, Sonal Nube Technologies On Thu, Sep 28, 2017 at 9:34 PM,

Re: More instances = slower Spark job

2017-09-28 Thread Sonal Goyal
Also check if the compression algorithm you use is splittable? Thanks, Sonal Nube Technologies On Thu, Sep 28, 2017 at 2:17 PM, Tejeshwar J1 < tejeshwar...@globallogic.com.invalid> wrote: > Hi Miller, > > > > Try using > >

Re: Process large JSON file without causing OOM

2017-11-13 Thread Sonal Goyal
If you are running Spark with local[*] as master, there will be a single process whose memory will be controlled by --driver-memory command line option to spark submit. Check http://spark.apache.org/docs/latest/configuration.html spark.driver.memory 1g Amount of memory to use for the driver

Re: How can I do the following simple scenario in spark

2018-06-19 Thread Sonal Goyal
Try flatMapToPair instead of flatMap Thanks, Sonal Nube Technologies On Tue, Jun 19, 2018 at 11:08 PM, Soheil Pourbafrani wrote: > Hi, I have a JSON file in the following structure: > ++---+

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Sonal Goyal
Hi Patrick, Sorry is there something here that helps you beyond repartition(number of partitons) or calling your udf on foreachPartition? If your data is on disk, Spark is already partitioning it for you by rows. How is adding the host info helping? Thanks, Sonal Nube Technologies

Re: How to deal with context dependent computing?

2018-08-23 Thread Sonal Goyal
Hi Junfeng, Can you please show by means of an example what you are trying to achieve? Thanks, Sonal Nube Technologies On Thu, Aug 23, 2018 at 8:22 AM, JF Chen wrote: > For example, I have some data with timstamp marked as

Re: Caching small Rdd's take really long time and Spark seems frozen

2018-08-23 Thread Sonal Goyal
How are these small RDDs created? Could the blockage be in their compute creation instead of their caching? Thanks, Sonal Nube Technologies On Thu, Aug 23, 2018 at 6:38 PM, Guillermo Ortiz wrote: > I use spark with caching with

Re: Caching small Rdd's take really long time and Spark seems frozen

2018-08-24 Thread Sonal Goyal
and I saw that all the >> microbatches last the same time, so it seems that it's relation with >> caching these RDD's. >> >> El jue., 23 ago. 2018 a las 15:29, Sonal Goyal () >> escribió: >> >>> How are these small RDDs created? Could the blockage be in thei

Re: Default Java Opts Standalone

2018-08-30 Thread Sonal Goyal
Hi Eevee, For the executor, have you tried a. Passing --conf "spark.executor.extraJavaOptions=-XX" as part of the spark-submit command line if you want it application specific OR b. Setting spark.executor.extraJavaOptions in conf/spark-default.conf for all jobs. Thanks, Sonal Nube Technologies

Re: [External Sender] How to debug Spark job

2018-09-08 Thread Sonal Goyal
You could also try to profile your program on the executor or driver by using jvisualvm or yourkit to see if there is any memory/cpu optimization you could do. Thanks, Sonal Nube Technologies On Fri, Sep 7, 2018 at 6:35 PM, James

Re: Error in show()

2018-09-08 Thread Sonal Goyal
It says serialization error - could there be a column value which is not getting parsed as int in one of the rows 31-60? The relevant Python code in serializers.py which is throwing the error is def read_int(stream): length = stream.read(4) if not length: raise EOFError return

Re: Spark application complete it's job successfully on Yarn cluster but yarn register it as failed

2018-06-20 Thread Sonal Goyal
Have you checked the logs - they probably should have some more details. On Wed 20 Jun, 2018, 2:51 PM Soheil Pourbafrani, wrote: > Hi, > > I run a Spark application on Yarn cluster and it complete the process > successfully, but at the end Yarn print in the console: > > client token: N/A >

Re: Is it possible to rate limit an UDP?

2019-01-09 Thread Sonal Goyal
Have you tried controlling the number of partitions of the dataframe? Say you have 5 partitions, it means you are making 5 concurrent calls to the web service. The throughput of the web service would be your bottleneck and Spark workers would be waiting for tasks, but if you cant control the REST

Re: Is RDD thread safe?

2019-11-19 Thread Sonal Goyal
the RDD or the dataframe is distributed and partitioned by Spark so as to leverage all your workers (CPUs) effectively. So all the Dataframe operations are actually happening simultaneously on a section of the data. Why do you want to use threading here? Thanks, Sonal Nube Technologies

Re: [pyspark] Load a master data file to spark ecosystem

2020-04-24 Thread Sonal Goyal
How does your tree_lookup_value function work? Thanks, Sonal Nube Technologies On Fri, Apr 24, 2020 at 8:47 PM Arjun Chundiran wrote: > Hi Team, > > I have asked this question in stack overflow >

  1   2   >