Re: Tuning Resource Allocation during runtime

2018-04-27 Thread jogesh anand
rk Job containing two tasks: First task need many executors to > run fastly. the second task has many input and output opeartions and > shuffling, so it needs few executors, otherwise it taks loong time to > finish. > Does anyone knows if that possible in YARN? > > > Thank you. > Donni > -- *Regards,* *Jogesh Anand*

Kafka 010 Spark 2.2.0 Streaming / Custom checkpoint strategy

2017-10-13 Thread Anand Chandrashekar
Greetings! I would like to accomplish a custom kafka checkpoint strategy (instead of hdfs, i would like to use redis). is there a strategy I can use to change this behavior; any advise will help. Thanks! Regards, Anand.

How to read large size files from a directory ?

2017-05-09 Thread ashwini anand
just 50 lines of each file. Please find the code at below link https://gist.github.com/ashwini-anand/0e468da9b4ab7863dff14833d34de79e The size of each file of the directory can be very large in my case and because of this reason use of wholeTextFiles api will be inefficient in this case. Right now

How to read large size files from a directory ?

2017-05-09 Thread ashwini anand
I am reading each file of a directory using wholeTextFiles. After that I am calling a function on each element of the rdd using map . The whole program uses just 50 lines of each file. The code is as below:def processFiles(fileNameContentsPair): fileName= fileNameContentsPair[0] result =

How does partitioning happen for binary files in spark ?

2017-04-06 Thread ashwini anand
By looking into the source code, I found that for textFile(), the partitioning is computed by the computeSplitSize() function in FileInputFormat class. This function takes into consideration the minPartitions value passed by user. As per my understanding , the same thing for binaryFiles() is

Insert a JavaPairDStream into multiple cassandra table on the basis of key.

2016-11-03 Thread Abhishek Anand
Hi All, I have a JavaPairDStream. I want to insert this dstream into multiple cassandra tables on the basis of key. One approach is to filter each key and then insert it into cassandra table. But this would call filter operation "n" times depending on the number of keys. Is there any better

Re: spark infers date to be timestamp type

2016-10-26 Thread Anand Viswanathan
Hi, you can use the customSchema(for DateType) and specify dateFormat in .option(). or at spark dataframe side, you can convert the timestamp to date using cast to the column. Thanks and regards, Anand Viswanathan > On Oct 26, 2016, at 8:07 PM, Koert Kuipers <ko...@tresata.com&

Re: driver OOM - need recommended memory for driver

2016-09-19 Thread Anand Viswanathan
gt; I guess my assumption that "default resources (memory and cores) can handle any application" is wrong. Thanks and regards, Anand Viswanathan > On Sep 19, 2016, at 6:56 PM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > If you make your driver memory

Re: driver OOM - need recommended memory for driver

2016-09-19 Thread Anand Viswanathan
Thank you so much, Kevin. My data size is around 4GB. I am not using collect(), take() or takeSample() At the final job, number of tasks grows up to 200,000 Still the driver crashes with OOM with default —driver-memory 1g but Job succeeds if i specify 2g. Thanks and regards, Anand Viswanathan

driver OOM - need recommended memory for driver

2016-09-19 Thread Anand Viswanathan
-memory…? Please suggest. Thanks and regards, Anand.

Re: Finding unique across all columns in dataset

2016-09-19 Thread Abhishek Anand
<sauravsinh...@gmail.com> wrote: > >> You can use distinct over you data frame or rdd >> >> rdd.distinct >> >> It will give you distinct across your row. >> >> On Mon, Sep 19, 2016 at 2:35 PM, Abhishek Anand <abhis.anan...@gmail.com>

Finding unique across all columns in dataset

2016-09-19 Thread Abhishek Anand
I have an rdd which contains 14 different columns. I need to find the distinct across all the columns of rdd and write it to hdfs. How can I acheive this ? Is there any distributed data structure that I can use and keep on updating it as I traverse the new rows ? Regards, Abhi

Re: Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread Abhishek Anand
.length]; > for (int i = 0; i < concatColumns.length; i++) { > concatColumns[i]=df.col(array[i]); > } > > return functions.concat(concatColumns).alias(fieldName); > } > > > > On Mon, Jul 18, 2016 at 2:14 PM, Abhishek Anand <abhis.anan...@g

Re: Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread Abhishek Anand
for (int i = 0; i < columns.length; i++) { > selectColumns[i]=df.col(columns[i]); > } > > > selectColumns[columns.length]=functions.concat(df.col("firstname"), > df.col("lastname")); > > df.select(selectColumns).sh

Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread Abhishek Anand
Hi, I have a dataframe say having C0,C1,C2 and so on as columns. I need to create interaction variables to be taken as input for my program. For eg - I need to create I1 as concatenation of C0,C3,C5 Similarly, I2 = concat(C4,C5) and so on .. How can I achieve this in my Java code for

Change spark dataframe to LabeledPoint in Java

2016-06-30 Thread Abhishek Anand
Hi , I have a dataframe which i want to convert to labeled point. DataFrame labeleddf = model.transform(newdf).select("label","features"); How can I convert this to a LabeledPoint to use in my Logistic Regression model. I could do this in scala using val trainData = labeleddf.map(row =>

Re: spark.hadoop.dfs.replication parameter not working for kafka-spark streaming

2016-05-31 Thread Abhishek Anand
I also tried jsc.sparkContext().sc().hadoopConfiguration().set("dfs.replication", "2") But, still its not working. Any ideas why its not working ? Abhi On Tue, May 31, 2016 at 4:03 PM, Abhishek Anand <abhis.anan...@gmail.com> wrote: > My spark streaming

spark.hadoop.dfs.replication parameter not working for kafka-spark streaming

2016-05-31 Thread Abhishek Anand
My spark streaming checkpoint directory is being written to HDFS with default replication factor of 3. In my streaming application where I am listening from kafka and setting the dfs.replication = 2 as below the files are still being written with replication factor=3 SparkConf sparkConfig = new

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Abhishek Anand
features (columns of type string) will be one-hot > encoded automatically. > So pre-processing like `as.factor` is not necessary, you can directly feed > your data to the model training. > > Thanks > Yanbo > > 2016-05-30 2:06 GMT-07:00 Abhishek Anand <abhis.anan...@gma

Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Abhishek Anand
Hi , I want to run glm variant of sparkR for my data that is there in a csv file. I see that the glm function in sparkR takes a spark dataframe as input. Now, when I read a file from csv and create a spark dataframe, how could I take care of the factor variables/columns in my data ? Do I need

Unable to write stream record to cassandra table with multiple columns

2016-05-10 Thread Anand N Ilkal
I am trying to write incoming stream data to database. Following is the example program, this code creates a thread to listen to incoming stream of data which is csv data. this data needs to be split with delimiter and the array of data needs to be pushed to database as separate columns in the

Calculating log-loss for the trained model in Spark ML

2016-05-03 Thread Abhishek Anand
I am building a ML pipeline for logistic regression. val lr = new LogisticRegression() lr.setMaxIter(100).setRegParam(0.001) val pipeline = new Pipeline().setStages(Array(geoDimEncoder,clientTypeEncoder, devTypeDimIdEncoder,pubClientIdEncoder,tmpltIdEncoder,

Re: removing header from csv file

2016-05-03 Thread Abhishek Anand
You can use this function to remove the header from your dataset(applicable to RDD) def dropHeader(data: RDD[String]): RDD[String] = { data.mapPartitionsWithIndex((idx, lines) => { if (idx == 0) { lines.drop(1) } lines }) } Abhi On Wed, Apr 27, 2016 at

Clear Threshold in Logistic Regression ML Pipeline

2016-05-03 Thread Abhishek Anand
Hi All, I am trying to build a logistic regression pipeline in ML. How can I clear the threshold which by default is 0.5. In mllib I am able to clear the threshold to get the raw predictions using model.clearThreshold() function. Regards, Abhi

Fwd: Facing Unusual Behavior with the executors in spark streaming

2016-04-05 Thread Abhishek Anand
Hi , Needed inputs for a couple of issue that I am facing in my production environment. I am using spark version 1.4.0 spark streaming. 1) It so happens that the worker is lost on a machine and the executor still shows up in the executor's tab in the UI. Even when I kill a worker using kill -9

Re: Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-04-01 Thread Abhishek Anand
(SingleThreadEventExecutor.java:116) ... 1 more Cheers !! Abhi On Fri, Apr 1, 2016 at 9:04 AM, Abhishek Anand <abhis.anan...@gmail.com> wrote: > This is what I am getting in the executor logs > > 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while > revert

Re: Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-03-31 Thread Abhishek Anand
> > On Thu, Mar 31, 2016 at 11:32 AM, Abhishek Anand <abhis.anan...@gmail.com> > wrote: > >> >> Hi, >> >> Why is it so that when my disk space is full on one of the workers then >> the executor on that worker becomes unresponsive and the jobs on

Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-03-31 Thread Abhishek Anand
Hi, Why is it so that when my disk space is full on one of the workers then the executor on that worker becomes unresponsive and the jobs on that worker fails with the exception 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file

Output the data to external database at particular time in spark streaming

2016-03-08 Thread Abhishek Anand
I have a spark streaming job where I am aggregating the data by doing reduceByKeyAndWindow with inverse function. I am keeping the data in memory for upto 2 hours and In order to output the reduced data to an external storage I conditionally need to puke the data to DB say at every 15th minute of

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-29 Thread Abhishek Anand
<shixi...@databricks.com > wrote: > Sorry that I forgot to tell you that you should also call `rdd.count()` > for "reduceByKey" as well. Could you try it and see if it works? > > On Sat, Feb 27, 2016 at 1:17 PM, Abhishek Anand <abhis.anan...@gmail.com> >

Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-29 Thread Abhishek Anand
tall the snappy native library in your new machines? > > On Fri, Feb 26, 2016 at 11:05 PM, Abhishek Anand <abhis.anan...@gmail.com> > wrote: > >> Any insights on this ? >> >> On Fri, Feb 26, 2016 at 1:21 PM, Abhishek Anand <abhis.anan...@gmail.com> >> w

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-27 Thread Abhishek Anand
; } > }); > return stateDStream.stateSnapshots(); > > > On Mon, Feb 22, 2016 at 12:25 PM, Abhishek Anand <abhis.anan...@gmail.com> > wrote: > >> Hi Ryan, >> >> Reposting the code. >> >> Basically my use case is something like - I am re

Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-26 Thread Abhishek Anand
Any insights on this ? On Fri, Feb 26, 2016 at 1:21 PM, Abhishek Anand <abhis.anan...@gmail.com> wrote: > On changing the default compression codec which is snappy to lzf the > errors are gone !! > > How can I fix this using snappy as the codec ? > > Is there any downsid

Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-25 Thread Abhishek Anand
On changing the default compression codec which is snappy to lzf the errors are gone !! How can I fix this using snappy as the codec ? Is there any downside of using lzf as snappy is the default codec that ships with spark. Thanks !!! Abhi On Mon, Feb 22, 2016 at 7:42 PM, Abhishek Anand

Query Kafka Partitions from Spark SQL

2016-02-23 Thread Abhishek Anand
Is there a way to query the json (or any other format) data stored in kafka using spark sql by providing the offset range on each of the brokers ? I just want to be able to query all the partitions in a sq manner. Thanks ! Abhi

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Abhishek Anand
ry 10 batches. > However, there is a known issue that prevents mapWithState from > checkpointing in some special cases: > https://issues.apache.org/jira/browse/SPARK-6847 > > On Mon, Feb 22, 2016 at 5:47 AM, Abhishek Anand <abhis.anan...@gmail.com> > wrote: > >> Any Insi

java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-22 Thread Abhishek Anand
Hi , I am getting the following exception on running my spark streaming job. The same job has been running fine since long and when I added two new machines to my cluster I see the job failing with the following exception. 16/02/22 19:23:01 ERROR Executor: Exception in task 2.0 in stage

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-22 Thread Abhishek Anand
Any Insights on this one ? Thanks !!! Abhi On Mon, Feb 15, 2016 at 11:08 PM, Abhishek Anand <abhis.anan...@gmail.com> wrote: > I am now trying to use mapWithState in the following way using some > example codes. But, by looking at the DAG it does not seem to checkpoint > the

Re: Worker's BlockManager Folder not getting cleared

2016-02-17 Thread Abhishek Anand
Looking for answer to this. Is it safe to delete the older files using find . -type f -cmin +200 -name "shuffle*" -exec rm -rf {} \; For a window duration of 2 hours how older files can we delete ? Thanks. On Sun, Feb 14, 2016 at 12:34 PM, Abhishek Anand <abhis.anan...@gmail.com&

Re: Saving Kafka Offsets to Cassandra at begining of each batch in Spark Streaming

2016-02-16 Thread Abhishek Anand
raightforward to just use the normal cassandra client > to save them from the driver. > > On Tue, Feb 16, 2016 at 1:15 AM, Abhishek Anand <abhis.anan...@gmail.com> > wrote: > >> I have a kafka rdd and I need to save the offsets to cassandra table at >> the b

Saving Kafka Offsets to Cassandra at begining of each batch in Spark Streaming

2016-02-15 Thread Abhishek Anand
I have a kafka rdd and I need to save the offsets to cassandra table at the begining of each batch. Basically I need to write the offsets of the type Offsets below that I am getting inside foreachRD, to cassandra. The javafunctions api to write to cassandra needs a rdd. How can I create a rdd

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-15 Thread Abhishek Anand
p.m., "Ted Yu" <yuzhih...@gmail.com> wrote: > >> mapWithState supports checkpoint. >> >> There has been some bug fix since release of 1.6.0 >> e.g. >> SPARK-12591 NullPointerException using checkpointed mapWithState with >> KryoSerializer

Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-13 Thread Abhishek Anand
there. Is there any other work around ? Cheers!! Abhi On Fri, Feb 12, 2016 at 3:33 AM, Sebastian Piu <sebastian@gmail.com> wrote: > Looks like mapWithState could help you? > On 11 Feb 2016 8:40 p.m., "Abhishek Anand" <abhis.anan...@gmail.com> > wrote: > >&g

Re: Worker's BlockManager Folder not getting cleared

2016-02-13 Thread Abhishek Anand
Hi All, Any ideas on this one ? The size of this directory keeps on growing. I can see there are many files from a day earlier too. Cheers !! Abhi On Tue, Jan 26, 2016 at 7:13 PM, Abhishek Anand <abhis.anan...@gmail.com> wrote: > Hi Adrian, > > I am running spark in

Stateful Operation on JavaPairDStream Help Needed !!

2016-02-11 Thread Abhishek Anand
Hi All, I have an use case like follows in my production environment where I am listening from kafka with slideInterval of 1 min and windowLength of 2 hours. I have a JavaPairDStream where for each key I am getting the same key but with different value,which might appear in the same batch or

Re: Repartition taking place for all previous windows even after checkpointing

2016-02-01 Thread Abhishek Anand
Any insights on this ? On Fri, Jan 29, 2016 at 1:08 PM, Abhishek Anand <abhis.anan...@gmail.com> wrote: > Hi All, > > Can someone help me with the following doubts regarding checkpointing : > > My code flow is something like follows -> > > 1) create direct stre

Repartition taking place for all previous windows even after checkpointing

2016-01-28 Thread Abhishek Anand
Hi All, Can someone help me with the following doubts regarding checkpointing : My code flow is something like follows -> 1) create direct stream from kafka 2) repartition kafka stream 3) mapToPair followed by reduceByKey 4) filter 5) reduceByKeyAndWindow without the inverse function 6)

Re: Worker's BlockManager Folder not getting cleared

2016-01-26 Thread Abhishek Anand
be hitting > https://issues.apache.org/jira/browse/SPARK-10975 > With spark >= 1.6: > https://issues.apache.org/jira/browse/SPARK-12430 > and also be aware of: > https://issues.apache.org/jira/browse/SPARK-12583 > > > On 25/01/2016 07:14, Abhishek Anand wrote: > > Hi

Worker's BlockManager Folder not getting cleared

2016-01-24 Thread Abhishek Anand
Hi All, How long the shuffle files and data files are stored on the block manager folder of the workers. I have a spark streaming job with window duration of 2 hours and slide interval of 15 minutes. When I execute the following command in my block manager path find . -type f -cmin +150 -name

Getting kafka offsets at beginning of spark streaming application

2016-01-11 Thread Abhishek Anand
Hi, Is there a way so that I can fetch the offsets from where the spark streaming starts reading from Kafka when my application starts ? What I am trying is to create an initial RDD with offsest at a particular time passed as input from the command line and the offsets from where my spark

Error on using updateStateByKey

2015-12-18 Thread Abhishek Anand
I am trying to use updateStateByKey but receiving the following error. (Spark Version 1.4.0) Can someone please point out what might be the possible reason for this error. *The method updateStateByKey(Function2) in the type JavaPairDStream is

Re: Unable to use "Batch Start Time" on worker nodes.

2015-11-30 Thread Abhishek Anand
version > of transform that allows you specify a function with two params - the > parent RDD and the batch time at which the RDD was generated. > > TD > > On Thu, Nov 26, 2015 at 1:33 PM, Abhishek Anand <abhis.anan...@gmail.com> > wrote: > >> Hi , >> >

Unable to use "Batch Start Time" on worker nodes.

2015-11-26 Thread Abhishek Anand
Hi , I need to use batch start time in my spark streaming job. I need the value of batch start time inside one of the functions that is called within a flatmap function in java. Please suggest me how this can be done. I tried to use the StreamingListener class and set the value of a variable

External Table not getting updated from parquet files written by spark streaming

2015-11-19 Thread Abhishek Anand
Hi , I am using spark streaming to write the aggregated output as parquet files to the hdfs using SaveMode.Append. I have an external table created like : CREATE TABLE if not exists rolluptable USING org.apache.spark.sql.parquet OPTIONS ( path "hdfs:" ); I had an impression that in case

unsubscribe

2015-11-18 Thread VJ Anand
-- *VJ Anand* *Founder * *Sankia* vjan...@sankia.com 925-640-1340 www.sankia.com *Confidentiality Notice*: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use

SparkSQL on hive error

2015-10-27 Thread Anand Nalya
Runner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Any pointers, what might be wrong here? Regards, Anand

No suitable Constructor found while compiling

2015-10-18 Thread VJ Anand
I am trying to extend RDD in java, and when I call the parent constructor, it gives the error: no suitable constructor found for RDD (SparkContext, Seq, ClassTag). Here is the snippet of the code: class QueryShard extends RDD { sc (sc, (Seq)new ArrayBuffer,

Re: Join Order Optimization

2015-10-11 Thread VJ Anand
t; *Subject:* Join Order Optimization > > > > Hello, > > Does Spark-SQL support join order optimization as of the 1.5.1 release ? > From the release notes, I did not see support for this feature, but figured > will ask the users-list to be sure. > > Thanks > > Raajay

Custom RDD for Proprietary MPP database

2015-10-05 Thread VJ Anand
Hi, I need to build a RDD that supports a custom built Database (Which is sharded) across several nodes. I need to build an RDD that can support and provide the partitions specific to this database. I would like to do this in Java - I see there are JavaRDD, and other specific RDD available - my

Re: Checkpoint file not found

2015-08-03 Thread Anand Nalya
(notFilter).checkpoint(interval) toNotUpdate.foreachRDD(rdd = pending = rdd ) Thanks On 3 August 2015 at 13:09, Tathagata Das t...@databricks.com wrote: Can you tell us more about streaming app? DStream operation that you are using? On Sun, Aug 2, 2015 at 9:14 PM, Anand Nalya anand.na

Checkpoint file not found

2015-08-02 Thread Anand Nalya
Hi, I'm writing a Streaming application in Spark 1.3. After running for some time, I'm getting following execption. I'm sure, that no other process is modifying the hdfs file. Any idea, what might be the cause of this? 15/08/02 21:24:13 ERROR scheduler.DAGSchedulerEventProcessLoop:

Re: updateStateByKey schedule time

2015-07-21 Thread Anand Nalya
I also ran into a similar use case. Is this possible? On 15 July 2015 at 18:12, Michel Hubert mich...@vsnsystemen.nl wrote: Hi, I want to implement a time-out mechanism in de updateStateByKey(…) routine. But is there a way the retrieve the time of the start of the batch

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-10 Thread Anand Nalya
toNotUpdate = joined.filter(mynotfilter).map(mymap) base = base.union(toUpdate).reduceByKey(_+_, 2) current = toNotUpdate if(time.isMultipleOf(duration)){ base.checkpoint() current.checkpoint() } println(toUpdate.count()) // to persistence }) Thanks, Anand On 10 July

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Anand Nalya
it? On Thu, Jul 9, 2015 at 3:17 AM, Anand Nalya anand.na...@gmail.com wrote: Thats from the Streaming tab for Spark 1.4 WebUI. On 9 July 2015 at 15:35, Michel Hubert mich...@vsnsystemen.nl wrote: Hi, I was just wondering how you generated to second image with the charts. What product

Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Anand Nalya
illustrate the problem: Job execution: https://i.imgur.com/GVHeXH3.png?1 Delays: https://i.imgur.com/1DZHydw.png?1 Is there some pattern that I can use to avoid this? Regards, Anand

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Anand Nalya
Thats from the Streaming tab for Spark 1.4 WebUI. On 9 July 2015 at 15:35, Michel Hubert mich...@vsnsystemen.nl wrote: Hi, I was just wondering how you generated to second image with the charts. What product? *From:* Anand Nalya [mailto:anand.na...@gmail.com] *Sent:* donderdag 9

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Anand Nalya
, Anand On 9 July 2015 at 18:16, Dean Wampler deanwamp...@gmail.com wrote: Is myRDD outside a DStream? If so are you persisting on each batch iteration? It should be checkpointed frequently too. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product

[no subject]

2015-07-07 Thread Anand Nalya
... dstream.foreachRDD{ rdd = myRDD = myRDD.union(rdd.filter(myfilter)) } My questions is that for how long spark will keep RDDs underlying the dstream around? Is there some configuratoin knob that can control that? Regards, Anand

Split RDD into two in a single pass

2015-07-06 Thread Anand Nalya
of doing this in a single pass over the RDD so that when f returns true, the element goes to rdd1 and to rdd2 otherwise. Regards, Anand

Array fields in dataframe.write.jdbc

2015-07-02 Thread Anand Nalya
) } } Is there some way of getting arrays working for now? Thanks, Anand

Re: Cassandra Connection Issue with Spark-jobserver

2015-04-27 Thread Anand
I was able to fix the issues by providing right version of cassandra-all and thrift libraries -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-Connection-Issue-with-Spark-jobserver-tp22587p22664.html Sent from the Apache Spark User List mailing

Re: spark1.3.1 using mysql error!

2015-04-25 Thread Anand Mohan
and beeline Things would be better once SPARK-6966 is merged into 1.4.0 when you can use 1. use the --jars parameter for spark-shell, spark-sql, etc or 2. sc.addJar to add the driver after starting spark-shell. Good Luck, Anand Mohan -- View this message in context: http://apache-spark-user

Cassandra Connection Issue with Spark-jobserver

2015-04-21 Thread Anand
the $EXTRA_JAR variable to my cassandra-spark-connector-assembly. Regards, Anand* -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cassandra-Connection-Issue-with-Spark-jobserver-tp22587.html Sent from the Apache Spark User List mailing list archive at Nabble.com

OutOfMemory error in Spark Core

2015-01-15 Thread Anand Mohan
We have our Analytics App built on Spark 1.1 Core, Parquet, Avro and Spray. We are using Kryo serializer for the Avro objects read from Parquet and we are using our custom Kryo registrator (along the lines of ADAM

Re: AVRO specific records

2014-11-05 Thread Anand Iyer
You can also use the Kite SDK to read/write Avro records: https://github.com/kite-sdk/kite-examples/tree/master/spark - Anand On Wed, Nov 5, 2014 at 2:24 PM, Laird, Benjamin benjamin.la...@capitalone.com wrote: Something like this works and is how I create an RDD of specific records. val

Re: Spark SQL Percentile UDAF

2014-10-09 Thread Anand Mohan Tumuluri
Filed https://issues.apache.org/jira/browse/SPARK-3891 Thanks, Anand Mohan On Thu, Oct 9, 2014 at 7:13 PM, Michael Armbrust mich...@databricks.com wrote: Please file a JIRA:https://issues.apache.org/jira/browse/SPARK/ https://www.google.com/url?q=https%3A%2F%2Fissues.apache.org%2Fjira

Spark SQL HiveContext Projection Pushdown

2014-10-08 Thread Anand Mohan
to work and it ends up reading the whole Parquet data for each query.(which slows down a lot) Please see attached the screenshot of this. Hive itself doesnt seem to have any issues with the projection pushdown. So this is weird. Is this due to any configuration problem? Thanks in advance, Anand Mohan

Re: SparkContext startup time out

2014-07-26 Thread Anand Avati
I am bumping into this problem as well. I am trying to move to akka 2.3.x from 2.2.x in order to port to Scala 2.11 - only akka 2.3.x is available in Scala 2.11. All 2.2.x akka works fine, and all 2.3.x akka give the following exception in new SparkContext. Still investigating why..

Re: Spark on other parallel filesystems

2014-04-04 Thread Anand Avati
On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia matei.zaha...@gmail.comwrote: As long as the filesystem is mounted at the same path on every node, you should be able to just run Spark and use a file:// URL for your files. The only downside with running it this way is that Lustre won't expose