Tuning spark job to make count faster.

2021-04-05 Thread Krishna Chakka
Hi, I am working on a spark job. It takes 10 mins for the job just for the count() function. Question is How can I make it faster ? From the above image, what I understood is that there 4001 tasks are running in parallel. Total tasks are 76,553 . Here are the parameters that I am using for

unsubscribe

2020-04-24 Thread vijay krishna

unsubscribe

2020-01-17 Thread vijay krishna

connectivity

2019-12-01 Thread Krishna Chandran Nair
Hi Team, Can anyone provide the sample code to connect to azure to connect to ADLS using azure key vault(user managed key). Qatar Airways - Going Places Together Disclaimer:- This message (including attachments) is intended solely for the addressee named above. It may be confidential, privi

RE: [External]Re: error while connecting to azure blob storage

2019-08-23 Thread Krishna Chandran Nair
I sent you a fake token and key From: Julien Laurenceau Sent: 23 August 2019 6:06 PM To: Krishna Chandran Nair Cc: user@spark.apache.org Subject: [External]Re: error while connecting to azure blob storage Hi Did you just publicly disclosed a real token to tour blob storage ? If so, please be

RE: [External]Re: error while connecting to azure blob storage

2019-08-23 Thread Krishna Chandran Nair
Please find the attached error From: Roland Johann Sent: 23 August 2019 10:51 AM To: Krishna Chandran Nair Cc: user@spark.apache.org Subject: [External]Re: error while connecting to azure blob storage Hi Krishna, there seems to be no attachment. In addition, you should NEVER post private

RE: error while connecting to azure blob storage

2019-08-22 Thread Krishna Chandran Nair
Hi Team, I have written a small code to connect to azure blob storage but go error. I have attached the error log. Please help Calling command -- ./spark-submit stg.py --jars /home/citus/spark/spark-2.3.3-bin-hadoop2.7/jars/hadoop-azure-3.2.0.jar,/home/citus/spark/spark-2.3.3-bin-hadoop2.7/j

: Failed to create file system watcher service: User limit of inotify instances reached or too many open files

2018-08-22 Thread Polisetti, Venkata Siva Rama Gopala Krishna
Hi, When I am doing calculations for example 700 listID's it is saving only some 50 rows and then getting some random exceptions Getting below exception when I try to do calculations on huge data and try to save huge data . Please let me know if any suggestions. Sample Code : I have some

java.nio.file.FileSystemException: /tmp/spark- .._cache : No space left on device

2018-08-17 Thread Polisetti, Venkata Siva Rama Gopala Krishna
Hi Am getting below exception when I Run Spark-submit in linux machine , can someone give quick solution with commands Driver stacktrace: - Job 0 failed: count at DailyGainersAndLosersPublisher.scala:145, took 5.749450 s org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in

Scala Partition Question

2018-06-12 Thread Polisetti, Venkata Siva Rama Gopala Krishna
hello, Can I do complex data manipulations inside groupby function.? i.e. I want to group my whole dataframe by a column and then do some processing for each group. The information contained in this message is intended only for the recipient, and may be a conf

Re: Structured Streaming on Kubernetes

2018-04-16 Thread Krishna Kalyan
. >>> >>> >>> >>> However, I’m unaware of any specific use of streaming with the Spark on >>> Kubernetes integration right now. Would be curious to get feedback on the >>> failover behavior right now. >>> >>> >>> >

Structured Streaming on Kubernetes

2018-04-13 Thread Krishna Kalyan
is a relatively new addition to spark, I was wondering if structured streaming is stable in production. We were also evaluating Apache Beam with Flink. Regards, Krishna

Re: spark application running in yarn client mode is slower than in local mode.

2018-04-08 Thread Gopala Krishna Manchukonda
Hi Junfeng , Is your kafka topic partitioned? Are you referring to the duration or the CPU time spent by the job as being 20% - 50% higher than running in local? Thanks & Regards Gopal > On 09-Apr-2018, at 11:42 AM, Jörn Franke wrote: > > Probably network / shuffling cost? Or broadcast v

Unable to build spark documentation

2017-01-11 Thread Krishna Kalyan
://gist.github.com/krishnakalyan3/08f00f49a943e43600cbc6b21f307228 Could someone please advice on how to go about resolving this error?. Regards, Krishna

Unsubscribe

2016-12-16 Thread krishna ramachandran
Unsubscribe

Contributing to PySpark

2016-10-18 Thread Krishna Kalyan
Hello, I am a masters student. Could someone please let me know how set up my dev working environment to contribute to pyspark. Questions I had were: a) Should I use Intellij Idea or PyCharm?. b) How do I test my changes?. Regards, Krishna

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
does the data split and the datasets where they are allocated to. Cheers On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar wrote: > Hi, > Looks like the test-dataset has different sizes for X & Y. Possible > steps: > >1. What is the test-data-size ? > - If i

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar
Hi, Looks like the test-dataset has different sizes for X & Y. Possible steps: 1. What is the test-data-size ? - If it is 15,909, check the prediction variable vector - it is now 29,471, should be 15,909 - If you expect it to be 29,471, then the X Matrix is not right.

Re: Spark 1.6.2 version displayed as 1.6.1

2016-07-25 Thread Krishna Sankar
This intrigued me as well. - Just for sure, I downloaded the 1.6.2 code and recompiled. - spark-shell and pyspark both show 1.6.2 as expected. Cheers On Mon, Jul 25, 2016 at 1:45 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Another possible explanation is that by accide

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Krishna Sankar
Thanks Nick. I also ran into this issue. VG, One workaround is to drop the NaN from predictions (df.na.drop()) and then use the dataset for the evaluator. In real life, probably detect the NaN and recommend most popular on some window. HTH. Cheers On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath

Re: IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Radha krishna
I want to apply null comparison to a column in sqlcontext.sql, is there any way to achieve this? On Jul 10, 2016 8:55 PM, "Radha krishna" wrote: > Ok thank you, how to achieve the requirement. > > On Sun, Jul 10, 2016 at 8:44 PM, Sean Owen wrote: > >> It doesn'

Re: IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Radha krishna
Ok thank you, how to achieve the requirement. On Sun, Jul 10, 2016 at 8:44 PM, Sean Owen wrote: > It doesn't look like you have a NULL field, You have a string-value > field with an empty string. > > On Sun, Jul 10, 2016 at 3:19 PM, Radha krishna wrote: > > Hi All,IS NOT

IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Radha krishna
AS| | 16| | | 13| UK| | 14| US| | 20| As| | 15| IN| | 19| IR| | 11| PK| +---++ i am expecting the below one any idea, how to apply IS NOT NULL ? +---++ |_c0|code| +---++ | 18| AS| | 13| UK| | 14| US| | 20| As| | 15| IN| | 19| IR| | 11| PK| +---++ Thanks & Regards Radha krishna

Re: Spark Left outer Join issue using programmatic sql joins

2016-07-06 Thread Radha krishna
Hi Mich, Here I given just a sample data, I have some GB's of files in HDFS and performing left outer joins on those files, and the final result I am going to store in Vertica data base table. There is no duplicate columns in the target table but for the non matching rows columns I want to insert "

Re: Spark Left outer Join issue using programmatic sql joins

2016-07-06 Thread Radha krishna
gt; sqlContext.createDataFrame(empRDD, Emp.class); >>>> empDF.registerTempTable("EMP"); >>>> >>>>sqlContext.sql("SELECT * FROM EMP e LEFT OUTER JOIN >>>> DEPT d ON e.deptid >>>> = d.deptid").show(); >>>> >>>> >>>> >>>> //empDF.join(deptDF,empDF.col("deptid").equalTo(deptDF.col("deptid")),"leftouter").show();; >>>> >>>> } >>>> catch(Exception e){ >>>> System.out.println(e); >>>> } >>>> } >>>> public static Emp getInstance(String[] parts, Emp emp) throws >>>> ParseException { >>>> emp.setId(parts[0]); >>>> emp.setName(parts[1]); >>>> emp.setDeptid(parts[2]); >>>> >>>> return emp; >>>> } >>>> public static Dept getInstanceDept(String[] parts, Dept dept) >>>> throws >>>> ParseException { >>>> dept.setDeptid(parts[0]); >>>> dept.setDeptname(parts[1]); >>>> return dept; >>>> } >>>> } >>>> >>>> Input >>>> Emp >>>> 1001 aba 10 >>>> 1002 abs 20 >>>> 1003 abd 10 >>>> 1004 abf 30 >>>> 1005 abg 10 >>>> 1006 abh 20 >>>> 1007 abj 10 >>>> 1008 abk 30 >>>> 1009 abl 20 >>>> 1010 abq 10 >>>> >>>> Dept >>>> 10 dev >>>> 20 Test >>>> 30 IT >>>> >>>> Output >>>> +--+--++--++ >>>> |deptid|id|name|deptid|deptname| >>>> +--+--++--++ >>>> |10| 1001| aba|10| dev| >>>> |10| 1003| abd|10| dev| >>>> |10| 1005| abg|10| dev| >>>> |10| 1007| abj|10| dev| >>>> |10| 1010| abq|10| dev| >>>> |20| 1002| abs| null|null| >>>> |20| 1006| abh| null|null| >>>> |20| 1009| abl| null|null| >>>> |30| 1004| abf| null|null| >>>> |30| 1008| abk| null|null| >>>> +--+--++--++ >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Left-outer-Join-issue-using-programmatic-sql-joins-tp27295.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> - >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> > -- Thanks & Regards Radha krishna

Spark Left outer Join issue using programmatic sql joins

2016-07-06 Thread Radha krishna
Hi All, Please check below for the code and input and output, i think the output is not correct, i am missing any thing? pls guide Code public class Test { private static JavaSparkContext jsc = null; private static SQLContext sqlContext = null; private static Configuration hadoopConf = null; pu

How to write the DataFrame results back to HDFS with other then \n as record separator

2016-06-28 Thread Radha krishna
o hdfs with the same line separator (RS[\u001e]) Thanks & Regards Radha krishna

How to write the DataFrame results back to HDFS with other then \n as record separator

2016-06-24 Thread Radha krishna
) using java Can any one suggest.. Note: i need to use other than \n bez my data contains \n as part of the column value. Thanks & Regards Radha krishna

destroyPythonWorker job in PySpark

2016-06-23 Thread Krishna
Hi, I am running a PySpark app with 1000's of cores (partitions is a small multiple of # of cores) and the overall application performance is fine. However, I noticed that, at the end of the job, PySpark initiates job clean-up procedures and as part of this procedure, PySpark executes a job shown

Unsubscribe

2016-06-19 Thread Ram Krishna
Hi Sir, Please unsubscribe me -- Regards, Ram Krishna KT

Thanks For a Job Well Done !!!

2016-06-18 Thread Krishna Sankar
Hi all, Just wanted to thank all for the dataset API - most of the times we see only bugs in these lists ;o). - Putting some context, this weekend I was updating the SQL chapters of my book - it had all the ugliness of SchemaRDD, registerTempTable, take(10).foreach(println) and take

spark streaming - how to purge old data files in data directory

2016-06-18 Thread Vamsi Krishna
Hi, I'm on HDP 2.3.2 cluster (Spark 1.4.1). I have a spark streaming app which uses 'textFileStream' to stream simple CSV files and process. I see the old data files that are processed are left in the data directory. What is the right way to purge the old data files in data directory on HDFS? Tha

Re: Error Running SparkPi.scala Example

2016-06-17 Thread Krishna Kalyan
;, "SPARK_SUBMIT" -> "true", "spark.driver.cores" -> "5", "spark.ui.enabled" -> "false", "spark.driver.supervise" -> "true", "spark.app.name" -> "org.SomeClass", "spark.jars" -> "file:

Re: how to load compressed (gzip) csv file using spark-csv

2016-06-16 Thread Vamsi Krishna
Thanks. It works. On Thu, Jun 16, 2016 at 5:32 PM Hyukjin Kwon wrote: > It will 'auto-detect' the compression codec by the file extension and then > will decompress and read it correctly. > > Thanks! > > 2016-06-16 20:27 GMT+09:00 Vamsi Krishna : > >> Hi, &

how to load compressed (gzip) csv file using spark-csv

2016-06-16 Thread Vamsi Krishna
Hi, I'm using Spark 1.4.1 (HDP 2.3.2). As per the spark-csv documentation (https://github.com/databricks/spark-csv), I see that we can write to a csv file in compressed form using the 'codec' option. But, didn't see the support for 'codec' option to read a csv file. Is there a way to read a compr

Error Running SparkPi.scala Example

2016-06-15 Thread Krishna Kalyan
andler not found - continuing with a stub. Warning:scalac: Class org.jboss.netty.channel.group.ChannelGroup not found - continuing with a stub. Warning:scalac: Class com.google.common.collect.ImmutableMap not found - continuing with a stub. /Users/krishna/Experiment/spark/external/flume-sink/src/

Re: RBM in mllib

2016-06-14 Thread Krishna Kalyan
Hi Robert, According to the jira the Resolution is wont fix. The pull request was closed as it did not merge cleanly with the master. (https://github.com/apache/spark/pull/3222) On Tue, Jun 14, 2016 at 4:23 PM, Roberto Pagliari wrote: > Is RBM being developed? > > This one is marked as resolved,

Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Ram Krishna
maller first project building new functionality in > Spark as a good starting point rather than adding a new algorithm right > away, since you learn a lot in the process of making your first > contribution. > > On Friday, June 10, 2016, Ram Krishna wrote: > >> Hi All, >>

Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Ram Krishna
Hi All, How to add new ML algo in Spark MLlib. On Fri, Jun 10, 2016 at 12:50 PM, Ram Krishna wrote: > Hi All, > > I am new to this this field, I want to implement new ML algo using Spark > MLlib. What is the procedure. > > -- > Regards, > Ram Krishna KT > > >

Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Ram Krishna
Hi All, I am new to this this field, I want to implement new ML algo using Spark MLlib. What is the procedure. -- Regards, Ram Krishna KT

Re: Dataframe fails for large resultsize

2016-04-29 Thread Krishna
I recently encountered similar network related errors and was able to fix it by applying the ethtool updates decribed here [ https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-5085] On Friday, April 29, 2016, Buntu Dev wrote: > Just to provide more details, I have 200 blocks (parq

HADOOP_HOME or hadoop.home.dir are not set

2016-03-21 Thread Hari Krishna Dara
I am using Spark 1.5.2 in yarn mode with Hadoop 2.6.0 (cdh5.4.2) and I am consistently seeing the below exception in the map container logs for Spark jobs (full stacktrace at the end of the message): java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set. at org.apache.hadoop.util

Saving intermediate results in mapPartitions

2016-03-18 Thread Krishna
Hi, I've a situation where the number of elements output by each partition from mapPartitions don't fit into the RAM even with the lowest number of rows in the partition (there is a hard lower limit on this value). What's the best way to address this problem? During the mapPartition phase, is ther

Re: Is Spark right for us?

2016-03-06 Thread Krishna Sankar
Good question. It comes to computational complexity, computational scale and data volume. 1. If you can store the data in a single server or a small cluster of db server (say mysql) then hdfs/Spark might be an overkill 2. If you can run the computation/process the data on a single machine

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread krishna ramachandran
Also the cluster centroid I get in streaming mode (some with negative values) do not make sense - if I use the same data and run in batch KMeans.train(sc.parallelize(parsedData), numClusters, numIterations) cluster centers are what you would expect. Krishna On Fri, Feb 19, 2016 at 12:49 PM

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread krishna ramachandran
, 6706.05424139] and monitor. please let know if I missed something Krishna On Fri, Feb 19, 2016 at 10:59 AM, Bryan Cutler wrote: > Can you share more of your code to reproduce this issue? The model should > be updated with each batch, but can't tell what is happening from what y

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread krishna ramachandran
least until convergence) But am seeing same centers always for the entire duration - ran the app for several hours with a custom receiver. Yes I am using the latestModel to predict using "labeled" test data. But also like to know where my centers are regards Krishna On Fri, Feb 19, 201

Re: adding a split and union to a streaming application cause big performance hit

2016-02-18 Thread krishna ramachandran
en see out of memory error regards Krishna On Thu, Feb 18, 2016 at 4:54 AM, Ted Yu wrote: > bq. streamingContext.remember("duration") did not help > > Can you give a bit more detail on the above ? > Did you mean the job encountered OOME later on ? > > Which Spark re

streaming application redundant dag stage execution/performance/caching

2016-02-16 Thread krishna ramachandran
re is no dstream level "unpersist" setting "spark.streaming.unpersist" to true and streamingContext.remember("duration") did not help. Still seeing out of memory errors Krishna

Re: Maintain state outside rdd

2016-01-27 Thread Krishna
gt; > http://spark.apache.org/docs/latest/programming-guide.html#accumulators-a-nameaccumlinka > , > it also tells you how to define your own for custom data types. > > On Wed, Jan 27, 2016 at 7:22 PM, Krishna > wrote: > > mapPartitions(...) seems like a good candidate, since

Re: Maintain state outside rdd

2016-01-27 Thread Krishna
oring state in external > NoSQL store such as hbase ? > > On Wed, Jan 27, 2016 at 6:37 PM, Krishna wrote: > >> Thanks; What I'm looking for is a way to see changes to the state of some >> variable during map(..) phase. >> I simplified the scenario in my example by m

Re: Maintain state outside rdd

2016-01-27 Thread Krishna
at 6:25 PM, Ted Yu wrote: > Have you looked at this method ? > >* Zips this RDD with its element indices. The ordering is first based > on the partition index > ... > def zipWithIndex(): RDD[(T, Long)] = withScope { > > On Wed, Jan 27, 2016 at 6:03 PM, Krishna wrote: > &

Maintain state outside rdd

2016-01-27 Thread Krishna
Hi, I've a scenario where I need to maintain state that is local to a worker that can change during map operation. What's the best way to handle this? *incr = 0* *def row_index():* * global incr* * incr += 1* * return incr* *out_rdd = inp_rdd.map(lambda x: row_index()).collect()* "out_rdd" i

Window range in Spark

2016-01-26 Thread Krishna
Hi, We receive bursts of data with sequential ids and I would like to find the range for each burst-window. What's the best way to find the "window" ranges in Spark? Input --- 1 2 3 4 6 7 8 100 101 102 500 700 701 702 703 704 Output (window start, window end) ---

Read from AWS s3 with out having to hard-code sensitive keys

2016-01-11 Thread Krishna Rao
Hi all, Is there a method for reading from s3 without having to hard-code keys? The only 2 ways I've found both require this: 1. Set conf in code e.g.: sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "") sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "") 2. Set keys in URL, e.g.:

Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Krishna Rao
more on the use case? It looks a little bit > like an abuse of Spark in general . Interactive queries that are not > suitable for in-memory batch processing might be better supported by ignite > that has in-memory indexes, concept of hot, warm, cold data etc. or hive on > tez+ll

Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Krishna Rao
cified by query) value. I've experimented with using (abusing) Spark Streaming, by streaming queries and running these against the cached RDD. However, as I say I don't think that this is an intended use-case of Streaming. Cheers, Krishna

Merge rows into csv

2015-12-08 Thread Krishna
Hi, what is the most efficient way to perform a group-by operation in Spark and merge rows into csv? Here is the current RDD - ID STATE - 1 TX 1NY 1FL 2CA 2OH - This is the required output: --

Re: No suitable drivers found for postgresql

2015-11-13 Thread Krishna Sangeeth KS
​​ ​Hi,​ I have been trying to do this today at work with impala as the data source​ . I have been getting the same error as well. I am using PySpark APIs with Spark 1.3 version and I was wondering if there is any workaround for Pyspark. I don't think we can use --jars option in PySpark. ​Cheer

Re: HDP 2.3 support for Spark 1.5.x

2015-09-28 Thread Krishna Sankar
Thanks Guys. Yep, now I would install 1.5.1 over HDP 2.3, if that works. Cheers On Mon, Sep 28, 2015 at 9:47 AM, Ted Yu wrote: > Krishna: > If you want to query ORC files, see the following JIRA: > > [SPARK-10623] [SQL] Fixes ORC predicate push-down > > which is in the 1.5.1

HDP 2.3 support for Spark 1.5.x

2015-09-22 Thread Krishna Sankar
Guys, - We have HDP 2.3 installed just now. It comes with Spark 1.3.x. The current wisdom is that it will support the 1.4.x train (which is good, need DataFrame et al). - What is the plan to support Spark 1.5.x ? Can we install 1.5.0 on HDP 2.3 ? Or will Spark 1.5.x support be in HD

Re: Spark Effects of Driver Memory, Executor Memory, Driver Memory Overhead and Executor Memory Overhead on success of job runs

2015-09-01 Thread Krishna Sangeeth KS
Hi Timothy, I think the driver memory in all your examples is more than what is necessary in usual cases and executor memory is quite less. I found this devops talk[1] at spark-summit here to be super useful in understanding few of this configuration details. [1] https://.youtube.com/watch?v=l4Z

Re: Spark MLib v/s SparkR

2015-08-05 Thread Krishna Sankar
A few points to consider: a) SparkR gives the union of R_in_a_single_machine and the distributed_computing_of_Spark: b) It also gives the ability to wrangle with data in R, that is in the Spark eco system c) Coming to MLlib, the question is MLlib and R (not MLlib or R) - depending on the scale, dat

Re: Sum elements of an iterator inside an RDD

2015-07-11 Thread Krishna Sankar
Looks like reduceByKey() should work here. Cheers On Sat, Jul 11, 2015 at 11:02 AM, leonida.gianfagna < leonida.gianfa...@gmail.com> wrote: > Thanks a lot oubrik, > > I got your point, my consideration is that sum() should be already a > built-in function for iterators in python. > Anyway I trie

import pyspark.sql.Row gives error in 1.4.1

2015-07-02 Thread Krishna Sankar
Error - ImportError: No module named Row Cheers & enjoy the long weekend

Re: making dataframe for different types using spark-csv

2015-07-01 Thread Krishna Sankar
- use .cast("...").alias('...') after the DataFrame is read. - sql.functions.udf for any domain-specific conversions. Cheers On Wed, Jul 1, 2015 at 11:03 AM, Hafiz Mujadid wrote: > Hi experts! > > > I am using spark-csv to lead csv data into dataframe. By default it makes > type of each

Re: SparkSQL built in functions

2015-06-29 Thread Krishna Sankar
Interesting. Looking at the definitions, sql.functions.pow is defined only for (col,col). Just as an experiment, create a column with value 2 and see if that works. Cheers On Mon, Jun 29, 2015 at 1:34 PM, Bob Corsaro wrote: > 1.4 and I did set the second parameter. The DSL works fine but trying

Writing data to hbase using Sparkstreaming

2015-06-12 Thread Vamshi Krishna
Hi I am trying to write data that is produced from kafka commandline producer for some topic. I am facing problem and unable to proceed. Below is my code which I am creating a jar and running through spark-submit on spark-shell. Am I doing wrong inside foreachRDD() ? What is wrong with SparkKafkaD

Re: Kmeans Labeled Point RDD

2015-05-21 Thread Krishna Sankar
You can predict and then zip it with the points RDD to get approx. same as LP. Cheers On Thu, May 21, 2015 at 6:19 PM, anneywarlord wrote: > Hello, > > New to Spark. I wanted to know if it is possible to use a Labeled Point RDD > in org.apache.spark.mllib.clustering.KMeans. After I cluster my d

Re: Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-23 Thread Krishna Sankar
Do you have commons-csv-1.1-bin.jar in your path somewhere ? I had to download and add this. Cheers On Wed, Apr 22, 2015 at 11:01 AM, Mohammed Omer wrote: > Afternoon all, > > I'm working with Scala 2.11.6, and Spark 1.3.1 built from source via: > > `mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipT

Re: Dataset announcement

2015-04-15 Thread Krishna Sankar
Thanks Olivier. Good work. Interesting in more than one ways - including training, benchmarking, testing new releases et al. One quick question - do you plan to make it available as an S3 bucket ? Cheers On Wed, Apr 15, 2015 at 5:58 PM, Olivier Chapelle wrote: > Dear Spark users, > > I would l

Re: column expression in left outer join for DataFrame

2015-03-25 Thread S Krishna
. select("country", "cnt").as("b") > val both = df_2.join(df_1, $"a.country" === $"b.country"), "left_outer") > > > > On Tue, Mar 24, 2015 at 11:57 PM, S Krishna wrote: > >> Hi, >> >> Thanks for your

Re: column expression in left outer join for DataFrame

2015-03-24 Thread S Krishna
Hi, Thanks for your response. I modified my code as per your suggestion, but now I am getting a runtime error. Here's my code: val df_1 = df.filter( df("event") === 0) . select("country", "cnt") val df_2 = df.filter( df("event") === 3) . select("country", "cnt

Re: IPyhon notebook command for spark need to be updated?

2015-03-20 Thread Krishna Sankar
Yep the command-option is gone. No big deal, just add the '%pylab inline' command as part of your notebook. Cheers On Fri, Mar 20, 2015 at 3:45 PM, cong yue wrote: > Hello : > > I tried ipython notebook with the following command in my enviroment. > > PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVE

Re: General Purpose Spark Cluster Hardware Requirements?

2015-03-08 Thread Krishna Sankar
Without knowing the data size, computation & storage requirements ... : - Dual 6 or 8 core machines, 256 GB memory each, 12-15 TB per machine. Probably 5-10 machines. - Don't go for the most exotic machines, otoh don't go for cheapest ones either. - Find a sweet spot with your ve

Re: Movie Recommendation tutorial

2015-02-24 Thread Krishna Sankar
what is a good number for a recommendation engine ? Cheers On Tue, Feb 24, 2015 at 1:03 AM, Guillaume Charhon < guilla...@databerries.com> wrote: > I am using Spark 1.2.1. > > Thank you Krishna, I am getting almost the same results as you so it must > be an error in the tut

Re: Movie Recommendation tutorial

2015-02-23 Thread Krishna Sankar
1. The RSME varies a little bit between the versions. 2. Partitioned the training,validation,test set like so: - training = ratings_rdd_01.filter(lambda x: (x[3] % 10) < 6) - validation = ratings_rdd_01.filter(lambda x: (x[3] % 10) >= 6 and (x[3] % 10) < 8) - test = ratin

Re: randomSplit instead of a huge map & reduce ?

2015-02-21 Thread Krishna Sankar
- Divide and conquer with reduceByKey (like Ashish mentioned, each pair being the key) would work - looks like a "mapReduce with combiners" problem. I think reduceByKey would use combiners while aggregateByKey wouldn't. - Could we optimize this further by using combineByKey directly

Re: spark-shell working in scala-2.11

2015-01-28 Thread Krishna Sankar
Stephen, Scala 2.11 worked fine for me. Did the dev change and then compile. Not using in production, but I go back and forth between 2.10 & 2.11. Cheers On Wed, Jan 28, 2015 at 12:18 PM, Stephen Haberman < stephen.haber...@gmail.com> wrote: > Hey, > > I recently compiled Spark master against

[no subject]

2015-01-10 Thread Krishna Sankar
Guys, registerTempTable("Employees") gives me the error Exception in thread "main" scala.ScalaReflectionException: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [/Applications/eclipse/plugins/org.scala-lang.scala-library_2.11.4.

Re: DeepLearning and Spark ?

2015-01-09 Thread Krishna Sankar
I am also looking at this domain. We could potentially use the broadcast capability in Spark to distribute the parameters. Haven't thought thru yet. Cheers On Fri, Jan 9, 2015 at 2:56 PM, Andrei wrote: > Does it makes sense to use Spark's actor system (e.g. via > SparkContext.env.actorSystem) t

Re: Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread Krishna Sankar
Interestingly Google Chrome translates the materials. Cheers On Tue, Jan 6, 2015 at 7:26 PM, Boromir Widas wrote: > I do not understand Chinese but the diagrams on that page are very helpful. > > On Tue, Jan 6, 2015 at 9:46 PM, eric wong wrote: > >> A good beginning if you are chinese. >> >> h

Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-03 Thread Krishna Sankar
Alec, Good questions. Suggestions: 1. Refactor the problem into layers viz. DFS, Data Store, DB, SQL Layer, Cache, Queue, App Server, App (Interface), App (backend ML) et al. 2. Then slot-in the appropriate technologies - may be even multiple technologies for the same layer and then

Re: Calling ALS-MlLib from desktop application/ Training ALS

2014-12-13 Thread Krishna Sankar
a) There is no absolute RSME - it depends on the domain. Also RSME is the error based on what you have seen so far, a snapshot of a slice of the domain. b) My suggestion is put the system in place, see what happens when users interact with the system and then you can think of reducing the RSME as n

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar
A very timely article http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/ Cheers P.S: Now reply to ALL. On Sun, Nov 23, 2014 at 7:16 PM, Krishna Sankar wrote: > Good point. > On the positive side, whether we choose the most efficient mechanism in > Scala might

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar
Good point. On the positive side, whether we choose the most efficient mechanism in Scala might not be as important, as the Spark framework mediates the distributed computation. Even if there is some declarative part in Spark, we can still choose an inefficient computation path that is not apparent

Re: Spark or MR, Scala or Java?

2014-11-22 Thread Krishna Sankar
Adding to already interesting answers: - "Is there any case where MR is better than Spark? I don't know what cases I should be used Spark by MR. When is MR faster than Spark?" - Many. MR would be better (am not saying faster ;o)) for - Very large dataset, - Multistage ma

Re: JdbcRDD

2014-11-18 Thread Krishna
Thanks Kidong. I'll try your approach. On Tue, Nov 18, 2014 at 4:22 PM, mykidong wrote: > I had also same problem to use JdbcRDD in java. > For me, I have written a class in scala to get JdbcRDD, and I call this > instance from java. > > for instance, JdbcRDDWrapper.scala like this: > > ... > >

JdbcRDD

2014-11-18 Thread Krishna
Hi, Are there any examples of using JdbcRDD in java available? Its not clear what is the last argument in this example ( https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/rdd/JdbcRDDSuite.scala ): sc = new SparkContext("local", "test") val rdd = new JdbcRDD( sc, ()

Re: Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-15 Thread S Krishna
Hi, I am using 1.1.0. I did set my twitter credentials and I am using the full path. I did not paste this in the public post. I am running on a cluster and getting the exception. Are you running in local or standalone mode? Thanks On Oct 15, 2014 3:20 AM, "Akhil Das" wrote: > I just ran the sam

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Krishna Sankar
Well done guys. MapReduce sort at that time was a good feat and Spark now has raised the bar with the ability to sort a PB. Like some of the folks in the list, a summary of what worked (and didn't) as well as the monitoring practices would be good. Cheers P.S: What are you folks planning next ? O

Re: Can not see any spark metrics on ganglia-web

2014-10-02 Thread Krishna Sankar
Hi, I am sure you can use the -Pspark-ganglia-lgpl switch to enable Ganglia. This step only adds the support for Hadoop,Yarn,Hive et al in the spark executable.No need to run if one is not using them. Cheers On Thu, Oct 2, 2014 at 12:29 PM, danilopds wrote: > Hi tsingfu, > > I want to see me

Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar
be 0.1 or 0.01? > > Best, > Burak > > - Original Message - > From: "Krishna Sankar" > To: user@spark.apache.org > Sent: Wednesday, October 1, 2014 12:43:20 PM > Subject: MLlib Linear Regression Mismatch > > Guys, >Obviously I am doing some

MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar
Guys, Obviously I am doing something wrong. May be 4 points are too small a dataset. Can you help me to figure out why the following doesn't work ? a) This works : data = [ LabeledPoint(0.0, [0.0]), LabeledPoint(10.0, [10.0]), LabeledPoint(20.0, [20.0]), LabeledPoint(30.0, [30.0]) ]

Re: MLlib 1.2 New & Interesting Features

2014-09-29 Thread Krishna Sankar
Thanks Xiangrui. Appreciate the insights. I have uploaded the initial version of my presentation at http://goo.gl/1nBD8N Cheers On Mon, Sep 29, 2014 at 12:17 AM, Xiangrui Meng wrote: > Hi Krishna, > > Some planned features for MLlib 1.2 can be found via Spark JIRA: > http://bi

MLlib 1.2 New & Interesting Features

2014-09-27 Thread Krishna Sankar
Guys, - Need help in terms of the interesting features coming up in MLlib 1.2. - I have a 2 Part, ~3 hr hands-on tutorial at the Big Data Tech Con - "The Hitchhiker's Guide to Machine Learning with Python & Apache Spark"[2] - At minimum, it would be good to take the last 30 mi

Re: mllib performance on mesos cluster

2014-09-24 Thread Sudha Krishna
Setting spark.mesos.coarse=true helped reduce the time on the mesos cluster from 17 min to around 6 min. The scheduler delay per task reduced from 40 ms to around 10 ms. thanks On Mon, Sep 22, 2014 at 12:36 PM, Xiangrui Meng wrote: > 1) MLlib 1.1 should be faster than 1.0 in general. What's th

Re: Spark webUI - application details page

2014-08-29 Thread Sudha Krishna
I specified as follows: spark.eventLog.dir /mapr/spark_io We use mapr fs for sharing files. I did not provide an ip address or port number - just the directory name on the shared filesystem. On Aug 29, 2014 8:28 AM, "Brad Miller" wrote: > How did you specify the HDFS path? When i put > > spark

Re: Spark as a application library vs infra

2014-07-27 Thread Krishna Sankar
- IMHO, #2 is preferred as it could work in any environment (Mesos, Standalone et al). While Spark needs HDFS (for any decent distributed system) YARN is not required at all - Meson is a lot better. - Also managing the app with appropriate bootstrap/deployment framework is more flexi

Re: Out of any idea

2014-07-19 Thread Krishna Sankar
Probably you have - if not, try a very simple app in the docker container and make sure it works. Sometimes resource contention/allocation can get in the way. This happened to me in the YARN container. Also try single worker thread. Cheers On Sat, Jul 19, 2014 at 2:39 PM, boci wrote: > Hi guys

  1   2   >