Re: Equally split a RDD partition into two partition at the same node

2017-01-14 Thread Rishi Yadav
Can you provide some more details: 1. How many partitions does RDD have 2. How big is the cluster On Sat, Jan 14, 2017 at 3:59 PM Fei Hu wrote: > Dear all, > > I want to equally divide a RDD partition into two partitions. That means, > the first half of elements in the

Re: Error: Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

2015-08-16 Thread Rishi Yadav
try --jars rather than --class to submit jar. On Fri, Aug 14, 2015 at 6:19 AM, Stephen Boesch java...@gmail.com wrote: The NoClassDefFoundException differs from ClassNotFoundException : it indicates an error while initializing that class: but the class is found in the classpath. Please

Re: Spark can't fetch application jar after adding it to HTTP server

2015-08-16 Thread Rishi Yadav
can you tell more about your environment. I understand you are running it on a single machine but is firewall enabled? On Sun, Aug 16, 2015 at 5:47 AM, t4ng0 manvendra.tom...@gmail.com wrote: Hi I am new to spark and trying to run standalone application using spark-submit. Whatever i could

Re: Can't understand the size of raw RDD and its DataFrame

2015-08-15 Thread Rishi Yadav
why are you expecting footprint of dataframe to be lower when it contains more information ( RDD + Schema) On Sat, Aug 15, 2015 at 6:35 PM, Todd bit1...@163.com wrote: Hi, With following code snippet, I cached the raw RDD(which is already in memory, but just for illustration) and its

Re: [MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Rishi Yadav
can you explain what transformation is failing. Here's a simple example. http://www.infoobjects.com/spark-calculating-correlation-using-rdd-of-vectors/ On Thu, Jul 23, 2015 at 5:37 AM, saif.a.ell...@wellsfargo.com wrote: I tried with a RDD[DenseVector] but RDDs are not transformable, so T+

Re: No suitable driver found for jdbc:mysql://

2015-07-22 Thread Rishi Yadav
try setting --driver-class-path On Wed, Jul 22, 2015 at 3:45 PM, roni roni.epi...@gmail.com wrote: Hi All, I have a cluster with spark 1.4. I am trying to save data to mysql but getting error Exception in thread main java.sql.SQLException: No suitable driver found for

Re: How to use DataFrame with MySQL

2015-03-23 Thread Rishi Yadav
for me, it's only working if I set --driver-class-path to mysql library. On Sun, Mar 22, 2015 at 11:29 PM, gavin zhang gavin@gmail.com wrote: OK,I found what the problem is: It couldn't work with mysql-connector-5.0.8. I updated the connector version to 5.1.34 and it worked. -- View

Re: Input validation for LogisticRegressionWithSGD

2015-03-15 Thread Rishi Yadav
ca you share some sample data On Sun, Mar 15, 2015 at 8:51 PM, Rohit U rjupadhy...@gmail.com wrote: Hi, I am trying to run LogisticRegressionWithSGD on RDD of LabeledPoints loaded using loadLibSVMFile: val logistic: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,

Re: Spark Release 1.3.0 DataFrame API

2015-03-14 Thread Rishi Yadav
programmatically specifying Schema needs import org.apache.spark.sql.type._ for StructType and StructField to resolve. On Sat, Mar 14, 2015 at 10:07 AM, Sean Owen so...@cloudera.com wrote: Yes I think this was already just fixed by: https://github.com/apache/spark/pull/4977 a .toDF() is

Re: Define size partitions

2015-01-30 Thread Rishi Yadav
if you are only concerned about big partition size you can specify number of partitions as an additional parameter while loading files form hdfs. On Fri, Jan 30, 2015 at 9:47 AM, Sven Krasser kras...@gmail.com wrote: You can also use your InputFormat/RecordReader in Spark, e.g. using

RangePartitioner

2015-01-20 Thread Rishi Yadav
I am joining two tables as below, the program stalls at below log line and never proceeds. What might be the issue and possible solution? INFO SparkContext: Starting job: RangePartitioner at Exchange.scala:79 Table 1 has  450 columns Table2 has  100 columns Both tables have few million

Re: Problem with StreamingContext - getting SPARK-2243

2015-01-08 Thread Rishi Yadav
you can also access SparkConf using sc.getConf in Spark shell though for StreamingContext you can directly refer sc as Akhil suggested. On Sun, Dec 28, 2014 at 12:13 AM, Akhil Das ak...@sigmoidanalytics.com wrote: In the shell you could do: val ssc = StreamingContext(*sc*, Seconds(1)) as

Re: JavaRDD (Data Aggregation) based on key

2015-01-08 Thread Rishi Yadav
One approach is to first transform this RDD into a PairRDD by taking the field you are going to do aggregation on as key On Tue, Dec 23, 2014 at 1:47 AM, sachin Singh sachin.sha...@gmail.com wrote: Hi, I have a csv file having fields as a,b,c . I want to do aggregation(sum,average..) based

Re: Profiling a spark application.

2015-01-08 Thread Rishi Yadav
as per my understanding RDDs do not get replicated, underlying Data does if it's in HDFS. On Thu, Dec 25, 2014 at 9:04 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I want to find the time taken for replicating an rdd in spark cluster along with the computation time on the

Re: Implement customized Join for SparkSQL

2015-01-08 Thread Rishi Yadav
Hi Kevin, Say A has 10 ids, so you are pulling data from B's data source only for these 10 ids? What if you load A and B as separate schemaRDDs and then do join. Spark will optimize the path anyway when action is fired . On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin yun...@ebay.com wrote: Hi,

Re: sparkContext.textFile does not honour the minPartitions argument

2015-01-01 Thread Rishi Yadav
Hi Ankit, Optional number of partitions value is to increase number of partitions not reduce it from default value. On Thu, Jan 1, 2015 at 10:43 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am trying to read a file into a single partition but it seems like sparkContext.textFile

Re: Cached RDD

2014-12-30 Thread Rishi Yadav
Without caching, each action is recomputed. So assuming rdd2 and rdd3 result in separate actions answer is yes. On Mon, Dec 29, 2014 at 7:53 PM, Corey Nolet cjno...@gmail.com wrote: If I have 2 RDDs which depend on the same RDD like the following: val rdd1 = ... val rdd2 = rdd1.groupBy()...

Re: reduceByKey and empty output files

2014-11-30 Thread Rishi Yadav
How big is your input dataset? On Thursday, November 27, 2014, Praveen Sripati praveensrip...@gmail.com wrote: Hi, When I run the below program, I see two files in the HDFS because the number of partitions in 2. But, one of the file is empty. Why is it so? Is the work not distributed

Re: optimize multiple filter operations

2014-11-28 Thread Rishi Yadav
you can try (scala version = you convert to python) val set = initial.groupBy( x = if (x == something) key1 else key2) This would do one pass over original data. On Fri, Nov 28, 2014 at 8:21 AM, mrm ma...@skimlinks.com wrote: Hi, My question is: I have multiple filter operations where I

Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-24 Thread Rishi Yadav
We keep conf as symbolic link so that upgrade is as simple as drop-in replacement On Monday, November 24, 2014, riginos samarasrigi...@gmail.com wrote: OK thank you very much for that! On 23 Nov 2014 21:49, Denny Lee [via Apache Spark User List] [hidden email]

Re: Declaring multiple RDDs and efficiency concerns

2014-11-14 Thread Rishi Yadav
how about using fluent style of Scala programming. On Fri, Nov 14, 2014 at 8:31 AM, Simone Franzini captainfr...@gmail.com wrote: Let's say I have to apply a complex sequence of operations to a certain RDD. In order to make code more modular/readable, I would typically have something like

Re: Assigning input files to spark partitions

2014-11-13 Thread Rishi Yadav
If your data is in hdfs and you are reading as textFile and each file is less than block size, my understanding is it would always have one partition per file. On Thursday, November 13, 2014, Daniel Siegmann daniel.siegm...@velos.io wrote: Would it make sense to read each file in as a separate

Re: Question about textFileStream

2014-11-12 Thread Rishi Yadav
yes, can you always specify minimum number of partitions and that would force some parallelism ( assuming you have enough cores) On Wed, Nov 12, 2014 at 9:36 AM, Saiph Kappa saiph.ka...@gmail.com wrote: What if the window is of 5 seconds, and the file takes longer than 5 seconds to be

Re: join 2 tables

2014-11-12 Thread Rishi Yadav
please use join syntax. On Wed, Nov 12, 2014 at 8:57 AM, Franco Barrientos franco.barrien...@exalitica.com wrote: I have 2 tables in a hive context, and I want to select one field of each table where id’s of each table are equal. For example, *val tmp2=sqlContext.sql(select

Re: S3 table to spark sql

2014-11-11 Thread Rishi Yadav
simple scala val date = new java.text.SimpleDateFormat(mmdd).parse(fechau3m) should work. Replace mmdd with the format fechau3m is in. If you want to do it at case class level: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) //HiveContext always a good idea import

Re: Spark SQL : how to find element where a field is in a given set

2014-11-02 Thread Rishi Yadav
did you create SQLContext? On Sat, Nov 1, 2014 at 7:51 PM, abhinav chowdary abhinav.chowd...@gmail.com wrote: I have same requirement of passing list of values to in clause, when i am trying to do i am getting below error scala val longList = Seq[Expression](a, b) console:11: error: type

Re: Bug in Accumulators...

2014-10-25 Thread Rishi Yadav
works fine. Spark 1.1.0 on REPL On Sat, Oct 25, 2014 at 1:41 PM, octavian.ganea octavian.ga...@inf.ethz.ch wrote: There is for sure a bug in the Accumulators code. More specifically, the following code works well as expected: def main(args: Array[String]) { val conf = new

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread Rishi Yadav
Hi Tridib, I changed SQLContext to HiveContext and it started working. These are steps I used. val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val person = sqlContext.jsonFile(json/person.json) person.printSchema() person.registerTempTable(person) val address =

Re: How to write a RDD into One Local Existing File?

2014-10-19 Thread Rishi Yadav
Write to hdfs and then get one file locally bu using hdfs dfs -getmerge... On Friday, October 17, 2014, Sean Owen so...@cloudera.com wrote: You can save to a local file. What are you trying and what doesn't work? You can output one file by repartitioning to 1 partition but this is probably

Re: Spark Streaming Twitter Example Error

2014-08-21 Thread Rishi Yadav
please add following three libraries to your class path. spark-streaming-twitter_2.10-1.0.0.jar twitter4j-core-3.0.3.jar twitter4j-stream-3.0.3.jar On Thu, Aug 21, 2014 at 1:09 PM, danilopds danilob...@gmail.com wrote: Hi! I'm beginning with the development in Spark Streaming.. And I'm