Re: How is union() implemented? Need to implement column bind

2022-04-21 Thread Sonal Goyal
Seems like an interesting problem to solve! If I have understood it correctly, you have 10114 files each with the structure rowid, colA r1, a r2, b r3, c ...5 million rows if you union them, you will have rowid, colA, colB r1, a, null r2, b, null r3, c, null r1, null, d r2, null, e r3,

Re: Profiling spark application

2022-01-19 Thread Sonal Goyal
Hi Prasad, Have you checked the SparkListener - https://mallikarjuna_g.gitbooks.io/spark/content/spark-SparkListener.html ? Cheers, Sonal https://github.com/zinggAI/zingg On Thu, Jan 20, 2022 at 10:49 AM Prasad Bhalerao < prasadbhalerao1...@gmail.com> wrote: > Hello, > > Is there any way we

Re: about memory size for loading file

2022-01-13 Thread Sonal Goyal
No it should not. The file would be partitioned and read across each node. On Fri, 14 Jan 2022 at 11:48 AM, frakass wrote: > Hello list > > Given the case I have a file whose size is 10GB. The ram of total > cluster is 24GB, three nodes. So the local node has only 8GB. > If I load this file

Re: Joining many tables Re: Pyspark debugging best practices

2022-01-03 Thread Sonal Goyal
Hi Andrew, Do you think the following would work? Build data frames by appending a column source to each (sampleName). Add extra columns as per scheme of quantSchema. Then union. So you get one data frame with many entries per name. You can then use windowing functions over them. On Tue, 4 Jan

Re: Feature (?): Setting custom parameters for a Spark MLlib pipeline

2021-10-25 Thread Sonal Goyal
> models. > > Cheers, > > Martin > > > > Am 2021-10-24 21:16, schrieb Sonal Goyal: > > Does MLFlow help you? https://mlflow.org/ > > I don't know if ML flow can save arbitrary key-value pairs and associate > them with a model, but versioning and evaluat

Re: Feature (?): Setting custom parameters for a Spark MLlib pipeline

2021-10-24 Thread Sonal Goyal
Does MLFlow help you? https://mlflow.org/ I don't know if ML flow can save arbitrary key-value pairs and associate them with a model, but versioning and evaluation etc are supported. Cheers, Sonal https://github.com/zinggAI/zingg On Wed, Oct 20, 2021 at 12:59 PM wrote: > Hello, > > This is

Re: How to change a DataFrame column from nullable to not nullable in PySpark

2021-10-14 Thread Sonal Goyal
I see some nice answers at https://stackoverflow.com/questions/46072411/can-i-change-the-nullability-of-a-column-in-my-spark-dataframe On Thu, 14 Oct 2021 at 5:21 PM, ashok34...@yahoo.com.INVALID wrote: > Gurus, > > I have an RDD in PySpark that I can convert to DF through > > df = rdd.toDF() >

Re: [EXTERNAL] [Marketing Mail] Re: [Spark] Optimize spark join on different keys for same data frame

2021-10-06 Thread Sonal Goyal
Have you tried partitioning df1, df2 on key1, join them Partition df3 and result above on key2 Join again That’s the strategy I use and it scales well for me. For reference check getBlocks in https://github.com/zinggAI/zingg/blob/main/core/src/main/java/zingg/Matcher.java On Tue, 5 Oct 2021

[Announcement] Zingg fuzzy matching for entity resolution, deduplication and data mastering

2021-09-13 Thread Sonal Goyal
Hi All, Super stoked to announce open sourcing Zingg, a Spark based tool to build unified customer and supplier profiles and remove duplicates. More details at https://github.com/zinggAI/zingg I do hope some of you will find it useful. Cheers, Sonal https://github.com/zinggAI/zingg

Re: How to submit a job via REST API?

2020-11-24 Thread Sonal Goyal
You should be able to supply the --conf and its values as part of appArgs argument Cheers, Sonal Nube Technologies Join me at Data Con LA Oct 23 | Big Data Conference Europe. Nov 24 | GIDS AI/ML Dec 3 On Tue, Nov 24, 2020 at 11:31 AM Dennis Suhari wrote: > Hi Yang,

Re: spark cassandra questiom

2020-11-23 Thread Sonal Goyal
Yes, it should be good to use Spark for this use case in my opinion. You can look into using the Cassandra Spark connector for persisting your updated data into Cassandra. Cheers, Sonal Nube Technologies Join me at Data Con LA Oct 23 | Big Data Conference Europe. Nov 24

Re: mission statement : unified

2020-10-19 Thread Sonal Goyal
My thought is that Spark supports analytics for structured and unstructured data, batch as well as real time. This was pretty revolutionary when Spark first came out. That's where the unified term came from I think. Even after all these years, Spark remains the trusted framework for enterprise

Re: [pyspark 2.3+] Dedupe records

2020-05-29 Thread Sonal Goyal
Hi Rishi, 1. Dataframes are RDDs under the cover. If you have unstructured data or if you know something about the data through which you can optimize the computation. you can go with RDDs. Else the Dataframes which are optimized by Spark SQL should be fine. 2. For incremental deduplication, I

Re: How to populate all possible combination values in columns using Spark SQL

2020-05-06 Thread Sonal Goyal
As mentioned in the comments on SO, can you provide a (masked) sample of the data? It will be easier to see what you are trying to do if you add the year column Thanks, Sonal Nube Technologies On Thu, May 7, 2020 at 10:26 AM

Re: [pyspark] Load a master data file to spark ecosystem

2020-04-24 Thread Sonal Goyal
How does your tree_lookup_value function work? Thanks, Sonal Nube Technologies On Fri, Apr 24, 2020 at 8:47 PM Arjun Chundiran wrote: > Hi Team, > > I have asked this question in stack overflow >

Re: Is RDD thread safe?

2019-11-19 Thread Sonal Goyal
the RDD or the dataframe is distributed and partitioned by Spark so as to leverage all your workers (CPUs) effectively. So all the Dataframe operations are actually happening simultaneously on a section of the data. Why do you want to use threading here? Thanks, Sonal Nube Technologies

Re: Is it possible to rate limit an UDP?

2019-01-09 Thread Sonal Goyal
Have you tried controlling the number of partitions of the dataframe? Say you have 5 partitions, it means you are making 5 concurrent calls to the web service. The throughput of the web service would be your bottleneck and Spark workers would be waiting for tasks, but if you cant control the REST

Re: Error in show()

2018-09-08 Thread Sonal Goyal
It says serialization error - could there be a column value which is not getting parsed as int in one of the rows 31-60? The relevant Python code in serializers.py which is throwing the error is def read_int(stream): length = stream.read(4) if not length: raise EOFError return

Re: [External Sender] How to debug Spark job

2018-09-08 Thread Sonal Goyal
You could also try to profile your program on the executor or driver by using jvisualvm or yourkit to see if there is any memory/cpu optimization you could do. Thanks, Sonal Nube Technologies On Fri, Sep 7, 2018 at 6:35 PM, James

Re: Default Java Opts Standalone

2018-08-30 Thread Sonal Goyal
Hi Eevee, For the executor, have you tried a. Passing --conf "spark.executor.extraJavaOptions=-XX" as part of the spark-submit command line if you want it application specific OR b. Setting spark.executor.extraJavaOptions in conf/spark-default.conf for all jobs. Thanks, Sonal Nube Technologies

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Sonal Goyal
Hi Patrick, Sorry is there something here that helps you beyond repartition(number of partitons) or calling your udf on foreachPartition? If your data is on disk, Spark is already partitioning it for you by rows. How is adding the host info helping? Thanks, Sonal Nube Technologies

Re: Caching small Rdd's take really long time and Spark seems frozen

2018-08-24 Thread Sonal Goyal
and I saw that all the >> microbatches last the same time, so it seems that it's relation with >> caching these RDD's. >> >> El jue., 23 ago. 2018 a las 15:29, Sonal Goyal () >> escribió: >> >>> How are these small RDDs created? Could the blockage be in thei

Re: Caching small Rdd's take really long time and Spark seems frozen

2018-08-23 Thread Sonal Goyal
How are these small RDDs created? Could the blockage be in their compute creation instead of their caching? Thanks, Sonal Nube Technologies On Thu, Aug 23, 2018 at 6:38 PM, Guillermo Ortiz wrote: > I use spark with caching with

Re: How to deal with context dependent computing?

2018-08-23 Thread Sonal Goyal
Hi Junfeng, Can you please show by means of an example what you are trying to achieve? Thanks, Sonal Nube Technologies On Thu, Aug 23, 2018 at 8:22 AM, JF Chen wrote: > For example, I have some data with timstamp marked as

Re: Spark application complete it's job successfully on Yarn cluster but yarn register it as failed

2018-06-20 Thread Sonal Goyal
Have you checked the logs - they probably should have some more details. On Wed 20 Jun, 2018, 2:51 PM Soheil Pourbafrani, wrote: > Hi, > > I run a Spark application on Yarn cluster and it complete the process > successfully, but at the end Yarn print in the console: > > client token: N/A >

Re: How can I do the following simple scenario in spark

2018-06-19 Thread Sonal Goyal
Try flatMapToPair instead of flatMap Thanks, Sonal Nube Technologies On Tue, Jun 19, 2018 at 11:08 PM, Soheil Pourbafrani wrote: > Hi, I have a JSON file in the following structure: > ++---+

Re: Process large JSON file without causing OOM

2017-11-13 Thread Sonal Goyal
If you are running Spark with local[*] as master, there will be a single process whose memory will be controlled by --driver-memory command line option to spark submit. Check http://spark.apache.org/docs/latest/configuration.html spark.driver.memory 1g Amount of memory to use for the driver

Re: Where can I get few GBs of sample data?

2017-09-28 Thread Sonal Goyal
Here are some links for public data sets https://aws.amazon.com/public-datasets/ https://www.springboard.com/blog/free-public-data-sets-data-science-project/ Thanks, Sonal Nube Technologies On Thu, Sep 28, 2017 at 9:34 PM,

Re: More instances = slower Spark job

2017-09-28 Thread Sonal Goyal
Also check if the compression algorithm you use is splittable? Thanks, Sonal Nube Technologies On Thu, Sep 28, 2017 at 2:17 PM, Tejeshwar J1 < tejeshwar...@globallogic.com.invalid> wrote: > Hi Miller, > > > > Try using > >

Re: Efficient Spark-Submit planning

2017-09-12 Thread Sonal Goyal
Overall the defaults are sensible, but you definitely have to look at your application and optimise a few of them. I mostly refer to the following links when the job is slow or failing or we have more hardware which we see we are not utilizing. http://spark.apache.org/docs/latest/tuning.html

Re: Reading PDF/text/word file efficiently with Spark

2017-05-23 Thread Sonal Goyal
Hi, Sorry it's not clear to me if you want help moving the data to the cluster or in defining the best structure of your files on the cluster for efficient processing. Are you on standalone or using hdfs? On Tuesday, May 23, 2017, docdwarf wrote: > tesmai4 wrote > > I am

Re: Adding worker dynamically in standalone mode

2017-05-15 Thread Sonal Goyal
If I remember correctly, just run the worker with master as current. On Monday, May 15, 2017, Seemanto Barua wrote: > Hi > > Is it possible to add a worker dynamically to the master in standalone > mode. If so can you please share the steps on how to ? > Thanks > --

Re: Monitoring the User Metrics for a long running Spark Job

2016-12-07 Thread Sonal Goyal
You can try updating metrics.properties for the sink of your choice. In our case, we add the following for getting application metrics in JSON format using http *.sink.reifier.class= org.apache.spark.metrics.sink.MetricsServlet Here, we have defined the sink with name reifier and its class is

Re: javac - No such file or directory

2016-11-09 Thread Sonal Goyal
It looks to be an issue with the java compiler, is the jdk setup correctly? Please check your java installation. Thanks, Sonal Nube Technologies On Wed, Nov 9, 2016 at 7:13 PM, Andrew Holway < andrew.hol...@otternetworks.de>

Re: Open source Spark based projects

2016-09-22 Thread Sonal Goyal
https://spark-packages.org/ Thanks, Sonal Nube Technologies On Thu, Sep 22, 2016 at 3:48 PM, Sean Owen wrote: > https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects > and maybe related

Re: How does wholeTextFiles() work in Spark-Hadoop Cluster?

2016-09-08 Thread Sonal Goyal
Are you looking at the worker logs or the driver? On Thursday, September 8, 2016, Nisha Menon wrote: > I have an RDD created as follows: > > *JavaPairRDD inputDataFiles = > sparkContext.wholeTextFiles("hdfs://ip:8020/user/cdhuser/inputFolder/");* > >

Re: [Spark submit] getting error when use properties file parameter in spark submit

2016-09-06 Thread Sonal Goyal
Looks like a classpath issue - Caused by: java.lang.ClassNotFoundException: com.amazonaws.services.s3.AmazonS3 Are you using S3 somewhere? Are the required jars in place? Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World

Re: Avoid Cartesian product in calculating a distance matrix?

2016-08-06 Thread Sonal Goyal
The general approach to the Cartesian problem is to first block or index your rows so that similar items fall in the same bucket, and then join within each bucket. Is that possible in your case? On Friday, August 5, 2016, Paschalis Veskos wrote: > Hello everyone, > > I am

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Sonal Goyal
Hi Tony, Would hash on the bid work for you? hash(cols: Column *): Column [image: Permalink]

Re: Extracting key word from a textual column

2016-08-02 Thread Sonal Goyal
Hi Mich, It seems like an entity resolution problem - looking at different representations of an entity - SAINSBURY in this case and matching them all together. How dirty is your data in the description - are there stop words like SACAT/SMKT etc you can strip off and get the base retailer entity

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-02 Thread Sonal Goyal
Hi Jestin, Which of your actions is the bottleneck? Is it group by, count or the join? Or all of them? It may help to tune the most time consuming ask first. On Monday, August 1, 2016, Nikolay Zhebet wrote: > Yes, Spark always trying to deliver snippet of code to the data

Re: What are using Spark for

2016-08-02 Thread Sonal Goyal
Hi Rohit, You can check the powered by spark page for some real usage of Spark. https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark On Tuesday, August 2, 2016, Rohit L wrote: > Hi Everyone, > > I want to know the real world uses cases for which

Re: Any reference of performance tuning on SparkSQL?

2016-07-28 Thread Sonal Goyal
I found some references at http://spark.apache.org/docs/latest/sql-programming-guide.html#performance-tuning http://apache-spark-user-list.1001560.n3.nabble.com/Performance-tuning-in-Spark-SQL-td21871.html HTH Best Regards, Sonal Founder, Nube Technologies Reifier at

Re: Possible to broadcast a function?

2016-06-29 Thread Sonal Goyal
Have you looked at Alluxio? (earlier tachyon) Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Sonal Goyal
What does your application do? Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: GraphX Java API

2016-05-31 Thread Sonal Goyal
Its very much possible to use GraphX through Java, though some boilerplate may be needed. Here is an example. Create a graph from edge and vertex RDD (JavaRDD> vertices, JavaRDD edges ) ClassTag longTag = scala.reflect.ClassTag$.MODULE$.apply(Long.class);

Re: Error while saving plots

2016-05-26 Thread Sonal Goyal
Does the path /home/njoshi/dev/outputs/test_/plots/ exist on the driver ? Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Sonal Goyal
You can look at ways to group records from both rdds together instead of doing Cartesian. Say generate pair rdd from each with first letter as key. Then do a partition and a join. On May 25, 2016 8:04 PM, "Priya Ch" wrote: > Hi, > RDD A is of size 30MB and RDD B

Re: Spark for offline log processing/querying

2016-05-22 Thread Sonal Goyal
Hi Mat, I think you could also use spark SQL to query the logs. Hope the following link helps https://databricks.com/blog/2014/09/23/databricks-reference-applications.html On May 23, 2016 10:59 AM, "Mat Schaffer" wrote: > I'm curious about trying to use spark as a cheap/slow

Re: strange HashPartitioner behavior in Spark

2016-04-18 Thread Sonal Goyal
Are you specifying your spark master in the scala program? Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: How to add a custom jar file to the Spark driver?

2016-03-09 Thread Sonal Goyal
Hi Gerhard, I just stumbled upon some documentation on EMR - link below. Seems there is a -u option to add jars in S3 to your classpath, have you tried that ? http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html Best Regards, Sonal Founder, Nube

Re: Understanding the Web_UI 4040

2016-03-07 Thread Sonal Goyal
Maybe check the worker logs to see what's going wrong with it? On Mar 7, 2016 9:10 AM, "Angel Angel" wrote: > Hello Sir/Madam, > > > I am running the spark-sql application on the cluster. > In my cluster there are 3 slaves and one Master. > > When i saw the progress of

Re: Spark Mllib kmeans execution

2016-03-02 Thread Sonal Goyal
It will run distributed On Mar 2, 2016 3:00 PM, "Priya Ch" wrote: > Hi All, > > I am running k-means clustering algorithm. Now, when I am running the > algorithm as - > > val conf = new SparkConf > val sc = new SparkContext(conf) > . > . > val kmeans = new

Re: Mllib Logistic Regression performance relative to Mahout

2016-03-01 Thread Sonal Goyal
You can also check if you are caching your input so that features are not being read/computed every iteration. Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: Getting prediction values in spark mllib

2016-02-11 Thread Sonal Goyal
Looks like you are doing binary classification and you are getting the label out. If you clear the model threshold, you should be able to get the raw score. On Feb 11, 2016 1:32 PM, "Chandan Verma" wrote: > > > Following is the code Snippet > > > > > >

Re: How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

2016-01-24 Thread Sonal Goyal
One thing you can also look at is to save your data in a way that can be accessed through file patterns. Eg by hour, zone etc so that you only load what you need. On Jan 24, 2016 10:00 PM, "Ilya Ganelin" wrote: > The solution I normally use is to zipWithIndex() and then use

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Sonal Goyal
I think you should be able to join different rdds with same key. Have you tried that? On Dec 1, 2015 3:30 PM, "Praveen Chundi" wrote: > cogroup could be useful to you, since all three are PairRDD's. > > >

Re: Datastore for GrpahX

2015-11-22 Thread Sonal Goyal
For graphx, you should be able to read and write data from practically any datastore Spark supports - flat files, rdbms, hadoop etc. If you want to save your graph as it is, check something like Neo4j. http://neo4j.com/developer/apache-spark/ Best Regards, Sonal Founder, Nube Technologies

Re: spark-submit stuck and no output in console

2015-11-17 Thread Sonal Goyal
I would suggest a couple of things to try A. Try running the example program with master as local[*]. See if spark can run locally or not. B. Check spark master and worker logs. C. Check if normal hadoop jobs can be run properly on the cluster. D. Check spark master webui and see health of

Re: how can evenly distribute my records in all partition

2015-11-17 Thread Sonal Goyal
Think about how you want to distribute your data and how your keys are spread currently. Do you want to compute something per day, per week etc. Based on that, return a partition number. You could use mod 30 or some such function to get the partitions. On Nov 18, 2015 5:17 AM, "prateek arora"

Re: spark-submit stuck and no output in console

2015-11-17 Thread Sonal Goyal
o all the datanodes. That process is still running > without hassle and it's only using 1.3 GB of 1.7g heap space. > > Initially, I submitted 2 jobs to the YARN cluster which was running for 2 > days and suddenly stops. Nothing in the logs shows the root cause. > > > On Tue,

Re: spark-submit stuck and no output in console

2015-11-17 Thread Sonal Goyal
Could it be jdk related ? Which version are you on? Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: does spark ML have some thing like createDataPartition() in R caret package ?

2015-11-13 Thread Sonal Goyal
The RDD has a takeSample method where you can supply the flag for replacement or not as well as the fraction to sample. On Nov 14, 2015 2:51 AM, "Andy Davidson" wrote: > In R, its easy to split a data set into training, crossValidation, and > test set. Is there

Re: Spark: How to find similar text title

2015-10-20 Thread Sonal Goyal
Do you want to compare within the rdd or do you have some external list or data coming in ? For matching, you could look at string edit distances or cosine similarity if you are only comparing title strings. On Oct 20, 2015 9:09 PM, "Ascot Moss" wrote: > Hi, > > I have my

Re: In-memory computing and cache() in Spark

2015-10-18 Thread Sonal Goyal
Hi Jia, RDDs are cached on the executor, not on the driver. I am assuming you are running locally and haven't changed spark.executor.memory? Sonal On Oct 19, 2015 1:58 AM, "Jia Zhan" wrote: Anyone has any clue what's going on.? Why would caching with 2g memory much faster

Re: java.io.InvalidClassException using spark1.4.1 for Terasort

2015-10-14 Thread Sonal Goyal
This is probably a versioning issue, are you sure your code is compiling and running against the same versions? On Oct 14, 2015 2:19 PM, "Shreeharsha G Neelakantachar" < shreeharsh...@in.ibm.com> wrote: > Hi, > I have Terasort being executed on spark1.4.1 with hadoop 2.7 for a > datasize of

Re: Web UI is not showing up

2015-09-01 Thread Sonal Goyal
The web ui is at port 8080. 4040 will show up something when you have a running job or if you have configured history server. On Sep 1, 2015 8:57 PM, "Sunil Rathee" wrote: > > Hi, > > > localhost:4040 is not showing anything on the browser. Do we have to start > some

Re: Web UI is not showing up

2015-09-01 Thread Sonal Goyal
? > > On Tue, Sep 1, 2015 at 9:04 PM, Sonal Goyal <sonalgoy...@gmail.com> wrote: > >> The web ui is at port 8080. 4040 will show up something when you have a >> running job or if you have configured history server. >> On Sep 1, 2015 8:57 PM, "Sunil Rathee&qu

Re: Any quick method to sample rdd based on one filed?

2015-08-28 Thread Sonal Goyal
Filter into true rdd and false rdd. Union true rdd and sample of false rdd. On Aug 28, 2015 2:57 AM, Gavin Yue yue.yuany...@gmail.com wrote: Hey, I have a RDD[(String,Boolean)]. I want to keep all Boolean: True rows and randomly keep some Boolean:false rows. And hope in the final result,

Re: Question on take function - Spark Java API

2015-08-26 Thread Sonal Goyal
You can try using wholeTextFile which will give you a pair rdd of fileName, content. flatMap through this and manipulate the content. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co Check out Reifier at Spark Summit 2015

Re: Spark

2015-08-25 Thread Sonal Goyal
August 2015 11:10 AM, Sonal Goyal sonalgoy...@gmail.com wrote: I think you could try sorting the endPointsCount and then doing a take. This should be a distributed process and only the result would get returned to the driver. Best Regards, Sonal Founder, Nube Technologies http

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Sonal Goyal
From what I have understood, you probably need to convert your vector to breeze and do your operations there. Check stackoverflow.com/questions/28232829/addition-of-two-rddmllib-linalg-vectors On Aug 25, 2015 7:06 PM, Kristina Rogale Plazonic kpl...@gmail.com wrote: Hi all, I'm still not clear

Re: Spark

2015-08-24 Thread Sonal Goyal
I think you could try sorting the endPointsCount and then doing a take. This should be a distributed process and only the result would get returned to the driver. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co Check out Reifier at Spark Summit 2015

Re: grpah x issue spark 1.3

2015-08-17 Thread Sonal Goyal
I have been using graphx in production on 1.3 and 1.4 with no issues. What's the exception you see and what are you trying to do? On Aug 17, 2015 10:49 AM, dizzy5112 dave.zee...@gmail.com wrote: Hi using spark 1.3 and trying some sample code: when i run: all works well but with it falls

Re: All masters are unresponsive! Giving up.

2015-08-07 Thread Sonal Goyal
There seems to be a version mismatch somewhere. You can try and find out the cause with debug serialization information. I think the jvm flag -Dsun.io.serialization.extendedDebugInfo=true should help. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co Check out Reifier at

Re: spark cluster setup

2015-08-03 Thread Sonal Goyal
org.apache.spark.SecurityManager: Changing view acls to: root Thanks. On Mon, Aug 3, 2015 at 11:52 AM, Sonal Goyal sonalgoy...@gmail.com wrote: What do the master logs show? Best Regards, Sonal Founder, Nube Technologies http://t.sidekickopen13.com/e1t/c/5

Re: spark cluster setup

2015-08-02 Thread Sonal Goyal
What do the master logs show? Best Regards, Sonal Founder, Nube Technologies http://t.sidekickopen13.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs1pNkJdVdDLZW1q7zBxW64k9XR56dLFLf58_ZT802?t=http%3A%2F%2Fwww.nubetech.co%2Fsi=5462006004973568pi=903294d1-e4a2-4926-cf03-b51cc168cfc1 Check out

Re: many-to-many join

2015-07-22 Thread Sonal Goyal
If I understand this correctly, you could join area_code_user and area_code_state and then flat map to get user, areacode, state. Then groupby/reduce by user. You can also try some join optimizations like partitioning on area code or broadcasting smaller table depending on size of

Re: ReduceByKey with a byte array as the key

2015-06-11 Thread Sonal Goyal
I think if you wrap the byte[] into an object and implement equals and hashcode methods, you may be able to do this. There will be the overhead of extra object, but conceptually it should work unless I am missing something. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co

Re: How to process data in chronological order

2015-05-21 Thread Sonal Goyal
Would partitioning your data based on the key and then running mapPartitions help? Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, May 21, 2015 at 4:33 AM, roy rp...@njit.edu wrote: I have a key-value RDD, key is a timestamp

Re: dependencies on java-netlib and jblas

2015-05-08 Thread Sonal Goyal
Hi John, I have been using MLLIB without installing jblas native dependence. Functionally I have not got stuck. I still need to explore if there are any performance hits. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, May 8,

Re: Use with Data justifying Spark

2015-04-01 Thread Sonal Goyal
Maybe check the examples? http://spark.apache.org/examples.html Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Wed, Apr 1, 2015 at 8:31 PM, Vila, Didier didier.v...@teradata.com wrote: Good Morning All, I would like to use

Re: [Graphx Spark] Error of Lost executor and TimeoutException

2015-02-02 Thread Sonal Goyal
/in/sonalgoyal On Sat, Jan 31, 2015 at 4:21 AM, Yifan LI iamyifa...@gmail.com wrote: Yes, I think so, esp. for a pregel application… have any suggestion? Best, Yifan LI On 30 Jan 2015, at 22:25, Sonal Goyal sonalgoy...@gmail.com wrote: Is your code hitting frequent garbage collection

Re: [Graphx Spark] Error of Lost executor and TimeoutException

2015-01-30 Thread Sonal Goyal
Is your code hitting frequent garbage collection? Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Jan 30, 2015 at 7:52 PM, Yifan LI iamyifa...@gmail.com wrote: Hi, I am running my graphx application on Spark 1.2.0(11

Re: MLLib in Production

2014-12-10 Thread Sonal Goyal
You can also serialize the model and use it in other places. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Wed, Dec 10, 2014 at 5:32 PM, Yanbo Liang yanboha...@gmail.com wrote: Hi Klaus, There is no ideal method but some

Re: Which function in spark is used to combine two RDDs by keys

2014-11-13 Thread Sonal Goyal
Check cogroup. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Nov 13, 2014 at 5:11 PM, Blind Faith person.of.b...@gmail.com wrote: Let us say I have the following two RDDs, with the following key-pair values. rdd1 =

Re: Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Sonal Goyal
Hi Darin, In our case, we were getting the error gue to long GC pauses in our app. Fixing the underlying code helped us remove this error. This is also mentioned as point 1 in the link below:

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Sonal Goyal
I believe the Spark Job Server by Ooyala can help you share data across multiple jobs, take a look at http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It seems to fit closely to what you need. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co

Re: Does spark works on multicore systems?

2014-11-09 Thread Sonal Goyal
Also, the level of parallelism would be affected by how big your input is. Could this be a problem in your case? On Sunday, November 9, 2014, Aaron Davidson ilike...@gmail.com wrote: oops, meant to cc userlist too On Sat, Nov 8, 2014 at 3:13 PM, Aaron Davidson ilike...@gmail.com

Re: Submiting Spark application through code

2014-10-31 Thread Sonal Goyal
What do your worker logs say? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 31, 2014 at 11:44 AM, sivarani whitefeathers...@gmail.com wrote: I tried running it but dint work public static final SparkConf batchConf= new

Re: Using a Database to persist and load data from

2014-10-31 Thread Sonal Goyal
I think you can try to use the Hadoop DBOutputFormat Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 31, 2014 at 1:00 PM, Kamal Banga ka...@sigmoidanalytics.com wrote: You can also use PairRDDFunctions' saveAsNewAPIHadoopFile

Re: A Spark Design Problem

2014-10-31 Thread Sonal Goyal
Does the following help? JavaPairRDDbin,key join with JavaPairRDDbin,lock If you partition both RDDs by the bin id, I think you should be able to get what you want. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 31, 2014 at

Re: LinearRegression and model prediction threshold

2014-10-31 Thread Sonal Goyal
You can serialize the model to a local/hdfs file system and use it later when you want. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Nov 1, 2014 at 12:02 AM, Sean Owen so...@cloudera.com wrote: It sounds like you are asking about

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sonal Goyal
Hey Sameer, Wouldnt local[x] run count parallelly in each of the x threads? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Thu, Oct 30, 2014 at 11:42 PM, Sameer Farooqui same...@databricks.com wrote: Hi Shahab, Are you running Spark

Re: Java api overhead?

2014-10-29 Thread Sonal Goyal
(and by default tries to work with all data in memory) and its written in Scala seeing lots of scala Tuple2 is not unexpected. how do these numbers relate to your data size? On Oct 27, 2014 2:26 PM, Sonal Goyal sonalgoy...@gmail.com javascript:_e(%7B%7D,'cvml','sonalgoy...@gmail.com'); wrote

Java api overhead?

2014-10-27 Thread Sonal Goyal
Hi, I wanted to understand what kind of memory overheads are expected if at all while using the Java API. My application seems to have a lot of live Tuple2 instances and I am hitting a lot of gc so I am wondering if I am doing something fundamentally wrong. Here is what the top of my heap looks

Re: Rdd of Rdds

2014-10-22 Thread Sonal Goyal
Another approach could be to create artificial keys for each RDD and convert to PairRDDs. So your first RDD becomes JavaPairRDDInt,String rdd1 with values 1,1 ; 1,2 and so on Second RDD becomes rdd2 is 2, a; 2, b;2,c You can union the two RDDs, groupByKey, countByKey etc and maybe achieve what

Re: Join with large data set

2014-10-17 Thread Sonal Goyal
Hi Ankur, If your rdds have common keys, you can look at partitioning both your datasets using a custom partitioner based on keys so that you can avoid shuffling and optimize join performance. HTH Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal

Re: key class requirement for PairedRDD ?

2014-10-17 Thread Sonal Goyal
We use our custom classes which are Serializable and have well defined hashcode and equals methods through the Java API. Whats the issue you are getting? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 17, 2014 at 12:28 PM, Jaonary

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Sonal Goyal
Cartesian joins of large datasets are usually going to be slow. If there is a way you can reduce the problem space to make sure you only join subsets with each other, that may be helpful. Maybe if you explain your problem in more detail, people on the list can come up with more suggestions. Best

  1   2   >