Tuning spark job to make count faster.

2021-04-06 Thread Krishna Chakka
Hi, I am working on a spark job. It takes 10 mins for the job just for the count() function. Question is How can I make it faster ? From the above image, what I understood is that there 4001 tasks are running in parallel. Total tasks are 76,553 . Here are the parameters that I am using for

unsubscribe

2020-04-24 Thread vijay krishna

unsubscribe

2020-01-17 Thread vijay krishna

connectivity

2019-12-01 Thread Krishna Chandran Nair
Hi Team, Can anyone provide the sample code to connect to azure to connect to ADLS using azure key vault(user managed key). Qatar Airways - Going Places Together Disclaimer:- This message (including attachments) is intended solely for the addressee named above. It may be confidential,

RE: [External]Re: error while connecting to azure blob storage

2019-08-23 Thread Krishna Chandran Nair
I sent you a fake token and key From: Julien Laurenceau Sent: 23 August 2019 6:06 PM To: Krishna Chandran Nair Cc: user@spark.apache.org Subject: [External]Re: error while connecting to azure blob storage Hi Did you just publicly disclosed a real token to tour blob storage ? If so, please

RE: [External]Re: error while connecting to azure blob storage

2019-08-23 Thread Krishna Chandran Nair
Please find the attached error From: Roland Johann Sent: 23 August 2019 10:51 AM To: Krishna Chandran Nair Cc: user@spark.apache.org Subject: [External]Re: error while connecting to azure blob storage Hi Krishna, there seems to be no attachment. In addition, you should NEVER post private

RE: error while connecting to azure blob storage

2019-08-23 Thread Krishna Chandran Nair
Hi Team, I have written a small code to connect to azure blob storage but go error. I have attached the error log. Please help Calling command -- ./spark-submit stg.py --jars

: Failed to create file system watcher service: User limit of inotify instances reached or too many open files

2018-08-22 Thread Polisetti, Venkata Siva Rama Gopala Krishna
Hi, When I am doing calculations for example 700 listID's it is saving only some 50 rows and then getting some random exceptions Getting below exception when I try to do calculations on huge data and try to save huge data . Please let me know if any suggestions. Sample Code : I have some

java.nio.file.FileSystemException: /tmp/spark- .._cache : No space left on device

2018-08-17 Thread Polisetti, Venkata Siva Rama Gopala Krishna
Hi Am getting below exception when I Run Spark-submit in linux machine , can someone give quick solution with commands Driver stacktrace: - Job 0 failed: count at DailyGainersAndLosersPublisher.scala:145, took 5.749450 s org.apache.spark.SparkException: Job aborted due to stage failure: Task 4

Scala Partition Question

2018-06-12 Thread Polisetti, Venkata Siva Rama Gopala Krishna
hello, Can I do complex data manipulations inside groupby function.? i.e. I want to group my whole dataframe by a column and then do some processing for each group. The information contained in this message is intended only for the recipient, and may be a

Re: Structured Streaming on Kubernetes

2018-04-16 Thread Krishna Kalyan
;>> directories. >>> >>> >>> >>> However, I’m unaware of any specific use of streaming with the Spark on >>> Kubernetes integration right now. Would be curious to get feedback on the >>> failover behavior right now. >>> >>> >>&g

Structured Streaming on Kubernetes

2018-04-13 Thread Krishna Kalyan
is a relatively new addition to spark, I was wondering if structured streaming is stable in production. We were also evaluating Apache Beam with Flink. Regards, Krishna

Re: spark application running in yarn client mode is slower than in local mode.

2018-04-09 Thread Gopala Krishna Manchukonda
Hi Junfeng , Is your kafka topic partitioned? Are you referring to the duration or the CPU time spent by the job as being 20% - 50% higher than running in local? Thanks & Regards Gopal > On 09-Apr-2018, at 11:42 AM, Jörn Franke wrote: > > Probably network /

Unable to build spark documentation

2017-01-11 Thread Krishna Kalyan
://gist.github.com/krishnakalyan3/08f00f49a943e43600cbc6b21f307228 Could someone please advice on how to go about resolving this error?. Regards, Krishna

Unsubscribe

2016-12-16 Thread krishna ramachandran
Unsubscribe

Contributing to PySpark

2016-10-18 Thread Krishna Kalyan
Hello, I am a masters student. Could someone please let me know how set up my dev working environment to contribute to pyspark. Questions I had were: a) Should I use Intellij Idea or PyCharm?. b) How do I test my changes?. Regards, Krishna

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
that does the data split and the datasets where they are allocated to. Cheers On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar <ksanka...@gmail.com> wrote: > Hi, > Looks like the test-dataset has different sizes for X & Y. Possible > steps: > >1. W

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
Hi, Looks like the test-dataset has different sizes for X & Y. Possible steps: 1. What is the test-data-size ? - If it is 15,909, check the prediction variable vector - it is now 29,471, should be 15,909 - If you expect it to be 29,471, then the X Matrix is not right.

Re: Spark 1.6.2 version displayed as 1.6.1

2016-07-25 Thread Krishna Sankar
This intrigued me as well. - Just for sure, I downloaded the 1.6.2 code and recompiled. - spark-shell and pyspark both show 1.6.2 as expected. Cheers On Mon, Jul 25, 2016 at 1:45 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Another possible explanation is that by

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Krishna Sankar
Thanks Nick. I also ran into this issue. VG, One workaround is to drop the NaN from predictions (df.na.drop()) and then use the dataset for the evaluator. In real life, probably detect the NaN and recommend most popular on some window. HTH. Cheers On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath

Re: IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Radha krishna
I want to apply null comparison to a column in sqlcontext.sql, is there any way to achieve this? On Jul 10, 2016 8:55 PM, "Radha krishna" <grkmc...@gmail.com> wrote: > Ok thank you, how to achieve the requirement. > > On Sun, Jul 10, 2016 at 8:44 PM, Sean Owen &

Re: IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Radha krishna
Ok thank you, how to achieve the requirement. On Sun, Jul 10, 2016 at 8:44 PM, Sean Owen <so...@cloudera.com> wrote: > It doesn't look like you have a NULL field, You have a string-value > field with an empty string. > > On Sun, Jul 10, 2016 at 3:19 PM, Radha krishna <grkm

IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Radha krishna
AS| | 16| | | 13| UK| | 14| US| | 20| As| | 15| IN| | 19| IR| | 11| PK| +---++ i am expecting the below one any idea, how to apply IS NOT NULL ? +---++ |_c0|code| +---++ | 18| AS| | 13| UK| | 14| US| | 20| As| | 15| IN| | 19| IR| | 11| PK| +---++ Thanks & Regards Radha krishna

Re: Spark Left outer Join issue using programmatic sql joins

2016-07-06 Thread Radha krishna
Hi Mich, Here I given just a sample data, I have some GB's of files in HDFS and performing left outer joins on those files, and the final result I am going to store in Vertica data base table. There is no duplicate columns in the target table but for the non matching rows columns I want to insert

Re: Spark Left outer Join issue using programmatic sql joins

2016-07-06 Thread Radha krishna
>>> DataFrame empDF = >>>> sqlContext.createDataFrame(empRDD, Emp.class); >>>> empDF.registerTempTable("EMP"); >>>> >>>> sqlContext.sql("SELECT * FROM EMP e LEFT OUTER JOIN >>>> DEPT d ON e.deptid >>>> = d.deptid").show(); >>>> >>>> >>>> >>>> //empDF.join(deptDF,empDF.col("deptid").equalTo(deptDF.col("deptid")),"leftouter").show();; >>>> >>>> } >>>> catch(Exception e){ >>>> System.out.println(e); >>>> } >>>> } >>>> public static Emp getInstance(String[] parts, Emp emp) throws >>>> ParseException { >>>> emp.setId(parts[0]); >>>> emp.setName(parts[1]); >>>> emp.setDeptid(parts[2]); >>>> >>>> return emp; >>>> } >>>> public static Dept getInstanceDept(String[] parts, Dept dept) >>>> throws >>>> ParseException { >>>> dept.setDeptid(parts[0]); >>>> dept.setDeptname(parts[1]); >>>> return dept; >>>> } >>>> } >>>> >>>> Input >>>> Emp >>>> 1001 aba 10 >>>> 1002 abs 20 >>>> 1003 abd 10 >>>> 1004 abf 30 >>>> 1005 abg 10 >>>> 1006 abh 20 >>>> 1007 abj 10 >>>> 1008 abk 30 >>>> 1009 abl 20 >>>> 1010 abq 10 >>>> >>>> Dept >>>> 10 dev >>>> 20 Test >>>> 30 IT >>>> >>>> Output >>>> +--+--++--++ >>>> |deptid|id|name|deptid|deptname| >>>> +--+--++--++ >>>> |10| 1001| aba|10| dev| >>>> |10| 1003| abd|10| dev| >>>> |10| 1005| abg|10| dev| >>>> |10| 1007| abj|10| dev| >>>> |10| 1010| abq|10| dev| >>>> |20| 1002| abs| null|null| >>>> |20| 1006| abh| null|null| >>>> |20| 1009| abl| null|null| >>>> |30| 1004| abf| null|null| >>>> |30| 1008| abk| null|null| >>>> +--+--++--++ >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Left-outer-Join-issue-using-programmatic-sql-joins-tp27295.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> - >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> > -- Thanks & Regards Radha krishna

Spark Left outer Join issue using programmatic sql joins

2016-07-06 Thread Radha krishna
Hi All, Please check below for the code and input and output, i think the output is not correct, i am missing any thing? pls guide Code public class Test { private static JavaSparkContext jsc = null; private static SQLContext sqlContext = null; private static Configuration hadoopConf = null;

How to write the DataFrame results back to HDFS with other then \n as record separator

2016-06-28 Thread Radha krishna
o hdfs with the same line separator (RS[\u001e]) Thanks & Regards Radha krishna

How to write the DataFrame results back to HDFS with other then \n as record separator

2016-06-24 Thread Radha krishna
) using java Can any one suggest.. Note: i need to use other than \n bez my data contains \n as part of the column value. Thanks & Regards Radha krishna

destroyPythonWorker job in PySpark

2016-06-23 Thread Krishna
Hi, I am running a PySpark app with 1000's of cores (partitions is a small multiple of # of cores) and the overall application performance is fine. However, I noticed that, at the end of the job, PySpark initiates job clean-up procedures and as part of this procedure, PySpark executes a job shown

Unsubscribe

2016-06-20 Thread Ram Krishna
Hi Sir, Please unsubscribe me -- Regards, Ram Krishna KT

Thanks For a Job Well Done !!!

2016-06-18 Thread Krishna Sankar
Hi all, Just wanted to thank all for the dataset API - most of the times we see only bugs in these lists ;o). - Putting some context, this weekend I was updating the SQL chapters of my book - it had all the ugliness of SchemaRDD, registerTempTable, take(10).foreach(println) and

spark streaming - how to purge old data files in data directory

2016-06-18 Thread Vamsi Krishna
Hi, I'm on HDP 2.3.2 cluster (Spark 1.4.1). I have a spark streaming app which uses 'textFileStream' to stream simple CSV files and process. I see the old data files that are processed are left in the data directory. What is the right way to purge the old data files in data directory on HDFS?

Re: Error Running SparkPi.scala Example

2016-06-17 Thread Krishna Kalyan
;, "SPARK_SUBMIT" -> "true", "spark.driver.cores" -> "5", "spark.ui.enabled" -> "false", "spark.driver.supervise" -> "true", "spark.app.name" -> "org.SomeClass", "spark.jars" -> "file:

Re: how to load compressed (gzip) csv file using spark-csv

2016-06-16 Thread Vamsi Krishna
Thanks. It works. On Thu, Jun 16, 2016 at 5:32 PM Hyukjin Kwon <gurwls...@gmail.com> wrote: > It will 'auto-detect' the compression codec by the file extension and then > will decompress and read it correctly. > > Thanks! > > 2016-06-16 20:27 GMT+09:00 Vamsi Krishna

how to load compressed (gzip) csv file using spark-csv

2016-06-16 Thread Vamsi Krishna
Hi, I'm using Spark 1.4.1 (HDP 2.3.2). As per the spark-csv documentation (https://github.com/databricks/spark-csv), I see that we can write to a csv file in compressed form using the 'codec' option. But, didn't see the support for 'codec' option to read a csv file. Is there a way to read a

Error Running SparkPi.scala Example

2016-06-15 Thread Krishna Kalyan
not found - continuing with a stub. Warning:scalac: Class org.jboss.netty.channel.group.ChannelGroup not found - continuing with a stub. Warning:scalac: Class com.google.common.collect.ImmutableMap not found - continuing with a stub. /Users/krishna/Experiment/spark/external/flume-sink/src/main/scala

Re: RBM in mllib

2016-06-14 Thread Krishna Kalyan
Hi Robert, According to the jira the Resolution is wont fix. The pull request was closed as it did not merge cleanly with the master. (https://github.com/apache/spark/pull/3222) On Tue, Jun 14, 2016 at 4:23 PM, Roberto Pagliari wrote: > Is RBM being developed? > >

Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Ram Krishna
ecommend a smaller first project building new functionality in > Spark as a good starting point rather than adding a new algorithm right > away, since you learn a lot in the process of making your first > contribution. > > On Friday, June 10, 2016, Ram Krishna <ramkrishna.a...@gmail

Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Ram Krishna
Hi All, How to add new ML algo in Spark MLlib. On Fri, Jun 10, 2016 at 12:50 PM, Ram Krishna <ramkrishna.a...@gmail.com> wrote: > Hi All, > > I am new to this this field, I want to implement new ML algo using Spark > MLlib. What is the procedure. > > -- >

Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Ram Krishna
Hi All, I am new to this this field, I want to implement new ML algo using Spark MLlib. What is the procedure. -- Regards, Ram Krishna KT

Re: Dataframe fails for large resultsize

2016-04-29 Thread Krishna
I recently encountered similar network related errors and was able to fix it by applying the ethtool updates decribed here [ https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-5085] On Friday, April 29, 2016, Buntu Dev wrote: > Just to provide more details, I

HADOOP_HOME or hadoop.home.dir are not set

2016-03-21 Thread Hari Krishna Dara
I am using Spark 1.5.2 in yarn mode with Hadoop 2.6.0 (cdh5.4.2) and I am consistently seeing the below exception in the map container logs for Spark jobs (full stacktrace at the end of the message): java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set. at

Saving intermediate results in mapPartitions

2016-03-18 Thread Krishna
Hi, I've a situation where the number of elements output by each partition from mapPartitions don't fit into the RAM even with the lowest number of rows in the partition (there is a hard lower limit on this value). What's the best way to address this problem? During the mapPartition phase, is

Re: Is Spark right for us?

2016-03-06 Thread Krishna Sankar
Good question. It comes to computational complexity, computational scale and data volume. 1. If you can store the data in a single server or a small cluster of db server (say mysql) then hdfs/Spark might be an overkill 2. If you can run the computation/process the data on a single

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread krishna ramachandran
Also the cluster centroid I get in streaming mode (some with negative values) do not make sense - if I use the same data and run in batch KMeans.train(sc.parallelize(parsedData), numClusters, numIterations) cluster centers are what you would expect. Krishna On Fri, Feb 19, 2016 at 12:49 PM

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread krishna ramachandran
, 6706.05424139] and monitor. please let know if I missed something Krishna On Fri, Feb 19, 2016 at 10:59 AM, Bryan Cutler <cutl...@gmail.com> wrote: > Can you share more of your code to reproduce this issue? The model should > be updated with each batch, but can't tell what

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread krishna ramachandran
(at least until convergence) But am seeing same centers always for the entire duration - ran the app for several hours with a custom receiver. Yes I am using the latestModel to predict using "labeled" test data. But also like to know where my centers are regards Krishna On Fri, Feb 19, 201

Re: adding a split and union to a streaming application cause big performance hit

2016-02-18 Thread krishna ramachandran
en see out of memory error regards Krishna On Thu, Feb 18, 2016 at 4:54 AM, Ted Yu <yuzhih...@gmail.com> wrote: > bq. streamingContext.remember("duration") did not help > > Can you give a bit more detail on the above ? > Did you mean the job encountered OOME

streaming application redundant dag stage execution/performance/caching

2016-02-16 Thread krishna ramachandran
no dstream level "unpersist" setting "spark.streaming.unpersist" to true and streamingContext.remember("duration") did not help. Still seeing out of memory errors Krishna

Re: Maintain state outside rdd

2016-01-27 Thread Krishna
e anything, how about storing state in external > NoSQL store such as hbase ? > > On Wed, Jan 27, 2016 at 6:37 PM, Krishna <research...@gmail.com> wrote: > >> Thanks; What I'm looking for is a way to see changes to the state of some >> variable during map(..) phase. >>

Maintain state outside rdd

2016-01-27 Thread Krishna
Hi, I've a scenario where I need to maintain state that is local to a worker that can change during map operation. What's the best way to handle this? *incr = 0* *def row_index():* * global incr* * incr += 1* * return incr* *out_rdd = inp_rdd.map(lambda x: row_index()).collect()* "out_rdd"

Re: Maintain state outside rdd

2016-01-27 Thread Krishna
ulators. Check out > > http://spark.apache.org/docs/latest/programming-guide.html#accumulators-a-nameaccumlinka > , > it also tells you how to define your own for custom data types. > > On Wed, Jan 27, 2016 at 7:22 PM, Krishna <research...@gmail.com > <javascript:;>> wrote: >

Re: Maintain state outside rdd

2016-01-27 Thread Krishna
5 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Have you looked at this method ? > >* Zips this RDD with its element indices. The ordering is first based > on the partition index > ... > def zipWithIndex(): RDD[(T, Long)] = withScope { > > On Wed, Jan 27, 2016 at 6:03

Window range in Spark

2016-01-26 Thread Krishna
Hi, We receive bursts of data with sequential ids and I would like to find the range for each burst-window. What's the best way to find the "window" ranges in Spark? Input --- 1 2 3 4 6 7 8 100 101 102 500 700 701 702 703 704 Output (window start, window end)

Read from AWS s3 with out having to hard-code sensitive keys

2016-01-11 Thread Krishna Rao
Hi all, Is there a method for reading from s3 without having to hard-code keys? The only 2 ways I've found both require this: 1. Set conf in code e.g.: sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "") sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "") 2. Set keys in URL, e.g.:

Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Krishna Rao
d by query) value. I've experimented with using (abusing) Spark Streaming, by streaming queries and running these against the cached RDD. However, as I say I don't think that this is an intended use-case of Streaming. Cheers, Krishna

Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Krishna Rao
ttle bit more on the use case? It looks a little bit > like an abuse of Spark in general . Interactive queries that are not > suitable for in-memory batch processing might be better supported by ignite > that has in-memory indexes, concept of hot, warm, cold data etc. or hive on > tez+llap

Merge rows into csv

2015-12-08 Thread Krishna
Hi, what is the most efficient way to perform a group-by operation in Spark and merge rows into csv? Here is the current RDD - ID STATE - 1 TX 1NY 1FL 2CA 2OH - This is the required output:

Re: HDP 2.3 support for Spark 1.5.x

2015-09-28 Thread Krishna Sankar
Thanks Guys. Yep, now I would install 1.5.1 over HDP 2.3, if that works. Cheers On Mon, Sep 28, 2015 at 9:47 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Krishna: > If you want to query ORC files, see the following JIRA: > > [SPARK-10623] [SQL] Fixes ORC predicate p

Re: Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs

2015-09-01 Thread Krishna Sangeeth KS
Hi Timothy, I think the driver memory in all your examples is more than what is necessary in usual cases and executor memory is quite less. I found this devops talk[1] at spark-summit here to be super useful in understanding few of this configuration details. [1]

Re: Spark MLib v/s SparkR

2015-08-05 Thread Krishna Sankar
A few points to consider: a) SparkR gives the union of R_in_a_single_machine and the distributed_computing_of_Spark: b) It also gives the ability to wrangle with data in R, that is in the Spark eco system c) Coming to MLlib, the question is MLlib and R (not MLlib or R) - depending on the scale,

Re: Sum elements of an iterator inside an RDD

2015-07-11 Thread Krishna Sankar
Looks like reduceByKey() should work here. Cheers k/ On Sat, Jul 11, 2015 at 11:02 AM, leonida.gianfagna leonida.gianfa...@gmail.com wrote: Thanks a lot oubrik, I got your point, my consideration is that sum() should be already a built-in function for iterators in python. Anyway I tried

import pyspark.sql.Row gives error in 1.4.1

2015-07-02 Thread Krishna Sankar
Error - ImportError: No module named Row Cheers enjoy the long weekend k/

Re: making dataframe for different types using spark-csv

2015-07-01 Thread Krishna Sankar
- use .cast(...).alias('...') after the DataFrame is read. - sql.functions.udf for any domain-specific conversions. Cheers k/ On Wed, Jul 1, 2015 at 11:03 AM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi experts! I am using spark-csv to lead csv data into dataframe. By default it

Re: SparkSQL built in functions

2015-06-29 Thread Krishna Sankar
Interesting. Looking at the definitions, sql.functions.pow is defined only for (col,col). Just as an experiment, create a column with value 2 and see if that works. Cheers k/ On Mon, Jun 29, 2015 at 1:34 PM, Bob Corsaro rcors...@gmail.com wrote: 1.4 and I did set the second parameter. The DSL

Writing data to hbase using Sparkstreaming

2015-06-12 Thread Vamshi Krishna
Hi I am trying to write data that is produced from kafka commandline producer for some topic. I am facing problem and unable to proceed. Below is my code which I am creating a jar and running through spark-submit on spark-shell. Am I doing wrong inside foreachRDD() ? What is wrong with

Re: Kmeans Labeled Point RDD

2015-05-21 Thread Krishna Sankar
You can predict and then zip it with the points RDD to get approx. same as LP. Cheers k/ On Thu, May 21, 2015 at 6:19 PM, anneywarlord anneywarl...@gmail.com wrote: Hello, New to Spark. I wanted to know if it is possible to use a Labeled Point RDD in org.apache.spark.mllib.clustering.KMeans.

Re: Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-23 Thread Krishna Sankar
Do you have commons-csv-1.1-bin.jar in your path somewhere ? I had to download and add this. Cheers k/ On Wed, Apr 22, 2015 at 11:01 AM, Mohammed Omer beancinemat...@gmail.com wrote: Afternoon all, I'm working with Scala 2.11.6, and Spark 1.3.1 built from source via: `mvn -Pyarn

Re: Dataset announcement

2015-04-15 Thread Krishna Sankar
Thanks Olivier. Good work. Interesting in more than one ways - including training, benchmarking, testing new releases et al. One quick question - do you plan to make it available as an S3 bucket ? Cheers k/ On Wed, Apr 15, 2015 at 5:58 PM, Olivier Chapelle oliv...@chapelle.cc wrote: Dear Spark

Re: column expression in left outer join for DataFrame

2015-03-25 Thread S Krishna
Hi, Thanks for your response. I modified my code as per your suggestion, but now I am getting a runtime error. Here's my code: val df_1 = df.filter( df(event) === 0) . select(country, cnt) val df_2 = df.filter( df(event) === 3) . select(country, cnt)

Re: column expression in left outer join for DataFrame

2015-03-25 Thread S Krishna
= df.filter( df(event) === 0) . select(country, cnt).as(a) val df_2 = df.filter( df(event) === 3) . select(country, cnt).as(b) val both = df_2.join(df_1, $a.country === $b.country), left_outer) On Tue, Mar 24, 2015 at 11:57 PM, S Krishna skrishna

Re: IPyhon notebook command for spark need to be updated?

2015-03-20 Thread Krishna Sankar
Yep the command-option is gone. No big deal, just add the '%pylab inline' command as part of your notebook. Cheers k/ On Fri, Mar 20, 2015 at 3:45 PM, cong yue yuecong1...@gmail.com wrote: Hello : I tried ipython notebook with the following command in my enviroment.

Re: General Purpose Spark Cluster Hardware Requirements?

2015-03-08 Thread Krishna Sankar
Without knowing the data size, computation storage requirements ... : - Dual 6 or 8 core machines, 256 GB memory each, 12-15 TB per machine. Probably 5-10 machines. - Don't go for the most exotic machines, otoh don't go for cheapest ones either. - Find a sweet spot with your

Re: Movie Recommendation tutorial

2015-02-24 Thread Krishna Sankar
number for a recommendation engine ? Cheers k/ On Tue, Feb 24, 2015 at 1:03 AM, Guillaume Charhon guilla...@databerries.com wrote: I am using Spark 1.2.1. Thank you Krishna, I am getting almost the same results as you so it must be an error in the tutorial. Xiangrui, I made some additional

Re: Movie Recommendation tutorial

2015-02-23 Thread Krishna Sankar
1. The RSME varies a little bit between the versions. 2. Partitioned the training,validation,test set like so: - training = ratings_rdd_01.filter(lambda x: (x[3] % 10) 6) - validation = ratings_rdd_01.filter(lambda x: (x[3] % 10) = 6 and (x[3] % 10) 8) - test =

Re: randomSplit instead of a huge map reduce ?

2015-02-21 Thread Krishna Sankar
- Divide and conquer with reduceByKey (like Ashish mentioned, each pair being the key) would work - looks like a mapReduce with combiners problem. I think reduceByKey would use combiners while aggregateByKey wouldn't. - Could we optimize this further by using combineByKey directly ?

Re: spark-shell working in scala-2.11

2015-01-28 Thread Krishna Sankar
Stephen, Scala 2.11 worked fine for me. Did the dev change and then compile. Not using in production, but I go back and forth between 2.10 2.11. Cheers k/ On Wed, Jan 28, 2015 at 12:18 PM, Stephen Haberman stephen.haber...@gmail.com wrote: Hey, I recently compiled Spark master against

[no subject]

2015-01-10 Thread Krishna Sankar
Guys, registerTempTable(Employees) gives me the error Exception in thread main scala.ScalaReflectionException: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath

Re: DeepLearning and Spark ?

2015-01-09 Thread Krishna Sankar
I am also looking at this domain. We could potentially use the broadcast capability in Spark to distribute the parameters. Haven't thought thru yet. Cheers k/ On Fri, Jan 9, 2015 at 2:56 PM, Andrei faithlessfri...@gmail.com wrote: Does it makes sense to use Spark's actor system (e.g. via

Re: Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread Krishna Sankar
Interestingly Google Chrome translates the materials. Cheers k/ On Tue, Jan 6, 2015 at 7:26 PM, Boromir Widas vcsub...@gmail.com wrote: I do not understand Chinese but the diagrams on that page are very helpful. On Tue, Jan 6, 2015 at 9:46 PM, eric wong win19...@gmail.com wrote: A good

Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-03 Thread Krishna Sankar
Alec, Good questions. Suggestions: 1. Refactor the problem into layers viz. DFS, Data Store, DB, SQL Layer, Cache, Queue, App Server, App (Interface), App (backend ML) et al. 2. Then slot-in the appropriate technologies - may be even multiple technologies for the same layer and

Re: Calling ALS-MlLib from desktop application/ Training ALS

2014-12-13 Thread Krishna Sankar
a) There is no absolute RSME - it depends on the domain. Also RSME is the error based on what you have seen so far, a snapshot of a slice of the domain. b) My suggestion is put the system in place, see what happens when users interact with the system and then you can think of reducing the RSME as

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar
Good point. On the positive side, whether we choose the most efficient mechanism in Scala might not be as important, as the Spark framework mediates the distributed computation. Even if there is some declarative part in Spark, we can still choose an inefficient computation path that is not

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar
A very timely article http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/ Cheers k/ P.S: Now reply to ALL. On Sun, Nov 23, 2014 at 7:16 PM, Krishna Sankar ksanka...@gmail.com wrote: Good point. On the positive side, whether we choose the most efficient mechanism in Scala might

Re: Spark or MR, Scala or Java?

2014-11-22 Thread Krishna Sankar
Adding to already interesting answers: - Is there any case where MR is better than Spark? I don't know what cases I should be used Spark by MR. When is MR faster than Spark? - Many. MR would be better (am not saying faster ;o)) for - Very large dataset, - Multistage

JdbcRDD

2014-11-18 Thread Krishna
Hi, Are there any examples of using JdbcRDD in java available? Its not clear what is the last argument in this example ( https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala ): sc = new SparkContext(local, test) val rdd = new JdbcRDD( sc, () =

Re: JdbcRDD

2014-11-18 Thread Krishna
Thanks Kidong. I'll try your approach. On Tue, Nov 18, 2014 at 4:22 PM, mykidong mykid...@gmail.com wrote: I had also same problem to use JdbcRDD in java. For me, I have written a class in scala to get JdbcRDD, and I call this instance from java. for instance, JdbcRDDWrapper.scala like

Re: Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-15 Thread S Krishna
Hi, I am using 1.1.0. I did set my twitter credentials and I am using the full path. I did not paste this in the public post. I am running on a cluster and getting the exception. Are you running in local or standalone mode? Thanks On Oct 15, 2014 3:20 AM, Akhil Das ak...@sigmoidanalytics.com

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Krishna Sankar
Well done guys. MapReduce sort at that time was a good feat and Spark now has raised the bar with the ability to sort a PB. Like some of the folks in the list, a summary of what worked (and didn't) as well as the monitoring practices would be good. Cheers k/ P.S: What are you folks planning next ?

Re: Can not see any spark metrics on ganglia-web

2014-10-02 Thread Krishna Sankar
Hi, I am sure you can use the -Pspark-ganglia-lgpl switch to enable Ganglia. This step only adds the support for Hadoop,Yarn,Hive et al in the spark executable.No need to run if one is not using them. Cheers k/ On Thu, Oct 2, 2014 at 12:29 PM, danilopds danilob...@gmail.com wrote: Hi

Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar
to be 0.1 or 0.01? Best, Burak - Original Message - From: Krishna Sankar ksanka...@gmail.com To: user@spark.apache.org Sent: Wednesday, October 1, 2014 12:43:20 PM Subject: MLlib Linear Regression Mismatch Guys, Obviously I am doing something wrong. May be 4 points are too small

Re: MLlib 1.2 New Interesting Features

2014-09-29 Thread Krishna Sankar
Thanks Xiangrui. Appreciate the insights. I have uploaded the initial version of my presentation at http://goo.gl/1nBD8N Cheers k/ On Mon, Sep 29, 2014 at 12:17 AM, Xiangrui Meng men...@gmail.com wrote: Hi Krishna, Some planned features for MLlib 1.2 can be found via Spark JIRA: http

MLlib 1.2 New Interesting Features

2014-09-27 Thread Krishna Sankar
Guys, - Need help in terms of the interesting features coming up in MLlib 1.2. - I have a 2 Part, ~3 hr hands-on tutorial at the Big Data Tech Con - The Hitchhiker's Guide to Machine Learning with Python Apache Spark[2] - At minimum, it would be good to take the last 30 min

Re: mllib performance on mesos cluster

2014-09-24 Thread Sudha Krishna
Setting spark.mesos.coarse=true helped reduce the time on the mesos cluster from 17 min to around 6 min. The scheduler delay per task reduced from 40 ms to around 10 ms. thanks On Mon, Sep 22, 2014 at 12:36 PM, Xiangrui Meng men...@gmail.com wrote: 1) MLlib 1.1 should be faster than 1.0 in

Re: Out of any idea

2014-07-19 Thread Krishna Sankar
Probably you have - if not, try a very simple app in the docker container and make sure it works. Sometimes resource contention/allocation can get in the way. This happened to me in the YARN container. Also try single worker thread. Cheers k/ On Sat, Jul 19, 2014 at 2:39 PM, boci

Re: Need help on spark Hbase

2014-07-15 Thread Krishna Sankar
One vector to check is the HBase libraries in the --jars as in : spark-submit --class your class --master master url --jars

Re: Requirements for Spark cluster

2014-07-09 Thread Krishna Sankar
I rsync the spark-1.0.1 directory to all the nodes. Yep, one needs Spark in all the nodes irrespective of Hadoop/YARN. Cheers k/ On Tue, Jul 8, 2014 at 6:24 PM, Robert James srobertja...@gmail.com wrote: I have a Spark app which runs well on local master. I'm now ready to put it on a

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Krishna Sankar
Konstantin, 1. You need to install the hadoop rpms on all nodes. If it is Hadoop 2, the nodes would have hdfs YARN. 2. Then you need to install Spark on all nodes. I haven't had experience with HDP, but the tech preview might have installed Spark as well. 3. In the end, one should

Re: Spark Processing Large Data Stuck

2014-06-21 Thread Krishna Sankar
Hi, - I have seen similar behavior before. As far as I can tell, the root cause is the out of memory error - verified this by monitoring the memory. - I had a 30 GB file and was running on a single machine with 16GB. So I knew it would fail. - But instead of raising an

Re: Spark streaming RDDs to Parquet records

2014-06-17 Thread Krishna Sankar
Mahesh, - One direction could be : create a parquet schema, convert save the records to hdfs. - This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala Cheers k/ On Tue, Jun 17, 2014 at 12:52 PM,

  1   2   >