Best practices on how to multiple spark sessions

2018-09-16 Thread unk1102
Hi I have application which servers as ETL job and I have hundreds of such ETL jobs which runs daily now as of now I have just one spark session which is shared by all these jobs and sometimes all of these jobs run at the same time causing spark session to die due memory issues mostly. Is this a

Is Spark DataFrame limit function action or transformation?

2018-05-31 Thread unk1102
Is Spark DataFrame limit function action or transformation? I think it returns DataFrame so it should be a transformation but it executes entire DAG so I think it is action. Same goes to persist function. Please guide. Thanks in advance. -- Sent from:

Spark horizontal scaling is not supported in which cluster mode? Ask

2018-05-21 Thread unk1102
Hi I came by one Spark question which was about which spark cluster manager does not support horizontal scalability? Answer options were Mesos, Yarn, Standalone and local mode. I believe all cluster managers are horizontal scalable please correct if I am wrong. And I think answer is local mode. Is

Best practices to keep multiple version of schema in Spark

2018-04-30 Thread unk1102
Hi I have a couple of datasets where schema keep on changing and I store it as parquet files. Now I use mergeSchema option while loading these different schema parquet files in a DataFrame and it works all fine. Now I have a requirement of maintaining difference between schema over time basically

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Thanks much Nicolas really appreciate it. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Hi Nicolas thanks much for guidance it was very useful information if you can push that code to github and share url it would be a great help. Looking forward. If you can find time to push early it would be even greater help as I have to finish POC on this use case ASAP. -- Sent from:

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Hi Nicolas thanks much for the reply. Do you have any sample code somewhere? Do your just keep pdf in avro binary all the time? How often you parse into text using pdfbox? Is it on demand basis or you always parse as text and keep pdf as binary in avro as just interim state? -- Sent from:

Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Hi I need guidance on dealing with large no of pdf files when using Hadoop and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it into text using these parsers and store it as text files but in doing

Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi Vadim thanks I use HortonWorks package. I dont think there are any seg faults are dataframe I am trying to write is very small in size. Can it still create seg fault? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi thanks Vadim you are right I saw that line already 468 I dont see any code it is just comment yes I am sure I am using all spark-* jar which is built for spark 2.2.0 and Scala 2.11. I am also stuck unfortunately with these errors not sure how to solve them. -- Sent from:

Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi thanks for the reply I only see NPE and Task failed while writing rows all over places I dont see any other errors expect SparkException job aborted and followed by two exception I pasted earlier. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi I am getting the following exception when I try to write DataFrame using the following code. Please guide. I am using Spark 2.2.0. df.write.format("parquet").mode(SaveMode.Append); org.apache.spark.SparkException: Task failed while writing rows at

Parquet error while saving in HDFS

2017-07-24 Thread unk1102
Hi I am getting the following error not sure why seems like race condition but I dont use any threads just one thread which owns spark context is writing to hdfs with one parquet partition. I am using Scala 2.10 and Spark 1.5.1. Please guide. Thanks in advance. java.io.IOException: The file

Re: How to give name to Spark jobs shown in Spark UI

2016-07-27 Thread unk1102
Thank Rahul I think you didn't read question properly I have one main spark job which I name using the approach you described. As part of main spark job I create multiple threads which essentially becomes child spark jobs and those jobs has no direct way of naming. On Jul 27, 2016 11:17,

How to give name to Spark jobs shown in Spark UI

2016-07-23 Thread unk1102
Hi I have multiple child spark jobs run at a time. Is there any way to name these child spark jobs so I can identify slow running ones. For e. g. xyz_saveAsTextFile(), abc_saveAsTextFile() etc please guide. Thanks in advance. -- View this message in context:

How to change Spark DataFrame groupby("col1",..,"coln") into reduceByKey()?

2016-05-22 Thread unk1102
Hi I have Spark job which does group by and I cant avoid it because of my use case. I have large dataset around 1 TB which I need to process/update in DataFrame. Now my jobs shuffles huge data and slows things because of shuffling and groupby. One reason I see is my data is skew some of my group

Does DataFrame has something like set hive.groupby.skewindata=true;

2016-05-21 Thread unk1102
Hi I am having DataFrame with huge skew data in terms of TB and I am doing groupby on 8 fields which I cant avoid unfortunately. I am looking to optimize this I have found hive has set hive.groupby.skewindata=true; I dont use Hive I have Spark DataFrame can we achieve above Spark? Please guide.

How to avoid empty unavoidable group by keys in DataFrame?

2016-05-21 Thread unk1102
Hi I have Spark job which does group by and I cant avoid it because of my use case. I have large dataset around 1 TB which I need to process/update in DataFrame. Now my jobs shuffles huge data and slows things because of shuffling and groupby. One reason I see is my data is skew some of my group

Spark 1.6.0 running jobs in yarn shows negative no of tasks in executor

2016-02-25 Thread unk1102
Hi I have spark job which I run on yarn and sometimes it behaves in weird manner it shows negative no of tasks in few executors and I keep on loosing executors I also see no of executors are more than I requested. My job is highly tuned not getting OOM or any problem. It is just YARN behaves in a

jssc.textFileStream(directory) how to ensure it read entire all incoming files

2016-02-09 Thread unk1102
Hi my actual use case is streaming text files in HDFS directory and send it to Kafka please let me know if is there any existing solution for this. Anyways I have the following code //lets assume directory contains one file a.txt and it has 100 lines JavaDStream logData =

Re: how to send JavaDStream RDD using foreachRDD using Java

2016-02-09 Thread unk1102
Hi Sachin, how did you write to Kafka from Spark I cant find the following method sendString and sendDataAsString in KafkaUtils can you please guide? KafkaUtil.sendString(p,topic,result.get(0)); KafkaUtils.sendDataAsString(MTP,topicName, result.get(0)); -- View this message in context:

Spark Streaming with Druid?

2016-02-06 Thread unk1102
Hi did anybody tried Spark Streaming with Druid as low latency store? Combination seems powerful is it worth trying both together? Please guide and share your experience. I am after creating the best low latency streaming analytics. -- View this message in context:

Spark MLLlib Ideal way to convert categorical features into LabeledPoint RDD?

2016-02-01 Thread unk1102
Hi I have dataset which is completely categorical and it does not contain even one column as numerical. Now I want to apply classification using Naive Bayes I have to predict whether given alert is actionable or not using YES/NO I have the following example of my dataset

Can we use localIterator when we need to process data in one partition?

2016-01-14 Thread unk1102
Hi I have special requirement when I need to process data in one partition at the last after doing many filtering,updating etc in a DataFrame. Currently to process data in one partition I am using coalesce(1) which is killing and painfully slow my jobs hangs for hours even 5-6 hours and I dont

Re: How to optimiz and make this code faster using coalesce(1) and mapPartitionIndex

2016-01-13 Thread unk1102
Hi thanks for the reply. Actually I cant share details as it is classified and pretty complex to understand as it is not general problem I am trying to solve related to database dynamic sql order execution. I need to use Spark as my other jobs which dont use coalesce uses spark. My source data is

How to optimiz and make this code faster using coalesce(1) and mapPartitionIndex

2016-01-12 Thread unk1102
Hi I have the following code which I run as part of thread which becomes child job of my main Spark job it takes hours to run for large data around 1-2GB because of coalesce(1) and if data is in MB/KB then it finishes faster with more data sets size sometimes it does not complete at all. Please

Do we need to enabled Tungsten sort in Spark 1.6?

2016-01-08 Thread unk1102
Hi I was using Spark 1.5 with Tungsten sort and now I have using Spark 1.6 I dont see any difference I was expecting Spark 1.6 to be faster. Anyways do we need to enable Tunsten and unsafe options or they are enabled by default I see in documentation that default sort manager is sort I though it

What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread unk1102
Hi As part of Spark 1.6 release what should be ideal value or unit for spark.memory.offheap.size I have set as 5000 I assume it will be 5GB is it correct? Please guide. -- View this message in context:

Why is this job running since one hour?

2016-01-06 Thread unk1102
Hi I have one main Spark job which spawns multiple child spark jobs. One of the child spark job is running for an hour and it keeps on hanging there I have taken snap shot please see -- View

Spark on Apache Ingnite?

2016-01-05 Thread unk1102
Hi has anybody tried and had success with Spark on Apache Ignite seems promising? https://ignite.apache.org/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-tp25884.html Sent from the Apache Spark User List mailing list archive at

coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread unk1102
hi I am trying to save many partitions of Dataframe into one CSV file and it take forever for large data sets of around 5-6 GB. sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").save("/path/hadoop") For small data above code works well but for large data it hangs

How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread unk1102
Hi I have a Spark job which hangs for around 7 hours or more than that until jobs killed out by Autosys because of time out. Data is not huge I am sure it stucks because of GC but I cant find source code which causes GC I am reusing almost all variable trying to minimize creating local objects

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-01 Thread unk1102
Sorry please see attached waiting thread log -- View this message in context:

Spark DataFrame callUdf does not compile?

2015-12-28 Thread unk1102
Hi I am trying to invoke Hive UDF using dataframe.select(callUdf("percentile_approx",col("C1"),lit(0.25))) but it does not compile however same call works in Spark scala console I dont understand

How to call mapPartitions on DataFrame?

2015-12-23 Thread unk1102
Hi I have the following code where I use mapPartitions on RDD but then I need to convert it into DataFrame so why do I need to convert DataFrame into RDD and back into DataFrame for just calling mapPartitions why can I call it directly on DataFrame? sourceFrame.toJavaRDD().mapPartitions(new

How to make this Spark 1.5.2 code fast and shuffle less data

2015-12-10 Thread unk1102
Hi I have spark job which reads Hive-ORC data and processes and generates csv file in the end. Now this ORC files are hive partitions and I have around 2000 partitions to process every day. These hive partitions size is around 800 GB in HDFS. I have the following method code which I call it from a

No support to save DataFrame in existing database table using DataFrameWriter.jdbc()

2015-12-06 Thread unk1102
Hi I would like to store/save DataFrame in a database table which is created already and want to insert into always without creating table every time. Unfortunately Spark API forces me to create table every time I have seen Spark source code the following calls uses same method beneath if you

Why does Spark job stucks and waits for only last tasks to get finished

2015-12-03 Thread unk1102
Hi I have Spark job where I keep queue of 12 Spark jobs to execute in parallel. Now I see job is almost completed and only task is pending and because of last task job will keep on waiting I can see in UI. Please see attached snaps. Please help me how to resolve Spark jobs from waiting for last

How to increase active job count to make spark job faster?

2015-10-27 Thread unk1102
Hi I have long running spark job which processes hadoop orc files and creates one hive partitions. Even if I have created ExecturService thread pool and use pool of 15 threads I see active job count as always 1 which makes job slow. How do I increase active job count in UI? I remember earlier it

Spark 1.5.1 hadoop 2.4 does not clear hive staging files after job finishes

2015-10-26 Thread unk1102
Hi I have spark job which creates hive table partitions I have switched to in spark 1.5.1 and spark 1.5.1 creates so many hive staging files and it doesn't delete it after job finishes. Is it a bug or do I need to disable something to prevents hive staging files from getting created or at least

Spark cant ORC files properly using 1.5.1 hadoop 2.6

2015-10-23 Thread unk1102
Hi I am having weird issue I have a Spark job which has bunch of hiveContext.sql() and creates ORC files as part of hive tables with partitions and it runs fine in 1.4.1 and hadoop 2.4. Now I tried to move to Spark 1.5.1/hadoop 2.6 Spark job does not work as expected it does not created ORC

How to calculate percentile of a column of DataFrame?

2015-10-09 Thread unk1102
Hi how to calculate percentile of a column in a DataFrame? I cant find any percentile_approx function in Spark aggregation functions. For e.g. in Hive we have percentile_approx and we can use it in the following way hiveContext.sql("select percentile_approx("mycol",0.25) from myTable); I can see

How to tune unavoidable group by query?

2015-10-09 Thread unk1102
Hi I have the following group by query which I tried to use it both using DataFrame and hiveContext.sql() but both shuffles huge data and is slow. I have around 8 fields passed in as group by fields sourceFrame.select("blabla").groupby("col1","col2","col3",..."col8").agg("bla bla"); OR

How to increase Spark partitions for the DataFrame?

2015-10-08 Thread unk1102
Hi I have the following code where I read ORC files from HDFS and it loads directory which contains 12 ORC files. Now since HDFS directory contains 12 files it will create 12 partitions by default. These directory is huge and when ORC files gets decompressed it becomes around 10 GB how do I

Why dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) hangs for long time?

2015-10-08 Thread unk1102
Hi as recommended I am caching my Spark job dataframe as dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) but what I see in Spark job UI is this persist stage runs for so long showing 10 GB of shuffle read and 5 GB of shuffle write it takes to long to finish and because of that sometimes my

How to avoid Spark shuffle spill memory?

2015-10-06 Thread unk1102
Hi I have a Spark job which runs for around 4 hours and it shared SparkContext and runs many child jobs. When I see each job in UI I see shuffle spill of around 30 to 40 GB and because of that many times executors gets lost because of using physical memory beyond limits how do I avoid shuffle

ORC files created by Spark job can't be accessed using hive table

2015-10-06 Thread unk1102
Hi I have a spark job which creates ORC files in partitions using the following code dataFrame.write().mode(SaveMode.Append).partitionBy("entity","date").format("orc").save("baseTable"); Above code creates successfully orc files which is readable in Spark dataframe But when I try to load orc

How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread unk1102
Hi I have couple of Spark jobs which uses group by query which is getting fired from hiveContext.sql() Now I know group by is evil but my use case I cant avoid group by I have around 7-8 fields on which I need to do group by. Also I am using df1.except(df2) which also seems heavy operation and

Can we using Spark Streaming to stream data from Hive table partitions?

2015-10-03 Thread unk1102
Hi I have couple of Spark jobs which reads Hive table partitions data and processes it independently in different threads in a driver. Now data to process is huge in terms of TB my jobs are not scaling and running slow. So I am thinking to use Spark Streaming as and when data is added into Hive

How to use registered Hive UDF in Spark DataFrame?

2015-10-02 Thread unk1102
Hi I have registed my hive UDF using the following code: hiveContext.udf().register("MyUDF",new UDF1(String,String)) { public String call(String o) throws Execption { //bla bla } },DataTypes.String); Now I want to use above MyUDF in DataFrame. How do we use it? I know how to use it in a sql and

How to save DataFrame as a Table in Hbase?

2015-10-01 Thread unk1102
Hi anybody tried to save DataFrame in HBase? I have processed data in DataFrame which I need to store in HBase so that my web ui can access it from Hbase? Please guide. Thanks in advance. -- View this message in context:

Hive ORC Malformed while loading into spark data frame

2015-09-29 Thread unk1102
Hi I have a spark job which creates hive tables in orc format with partitions. It works well I can read data back into hive table using hive console. But if I try further process orc files generated by Spark job by loading into dataframe then I get the following exception Caused by:

Best practices to call small spark jobs as part of REST api

2015-09-29 Thread unk1102
Hi I would like to know any best practices to call spark jobs in rest api. My Spark jobs returns results as json and that json can be used by UI application. Should we even have direct HDFS/Spark backend layer in UI for on demand queries? Please guide. Thanks much. -- View this message in

Best practices for scheduling Spark jobs on "shared" YARN cluster using Autosys

2015-09-25 Thread unk1102
Hi I have 5 Spark jobs which needs to be run in parallel to speed up process they take around 6-8 hours together. I have 93 container nodes with 8 cores each memory capacity of around 2.8 TB. Now I runs each jobs with around 30 executors with 2 cores and 20 GB each. My each jobs processes around 1

How to Hive UDF in Spark DataFrame?

2015-09-13 Thread unk1102
Hi I am using UDF in hiveContext.sql("") query inside it uses group by which forces huge data shuffle read of around 30 GB I am thinking to convert above query into DataFrame so that I avoid using group by. How do we use Hive UDF in Spark DataFrame? Please guide. Thanks much. -- View this

Best way to merge final output part files created by Spark job

2015-09-13 Thread unk1102
Hi I have a spark job which creates around 500 part files inside each directory I process. So I have thousands of such directories. So I need to merge these small small 500 part files. I am using spark.sql.shuffle.partition as 500 and my final small files are ORC files. Is there a way to merge orc

Why my Spark job is slow and it throws OOM which leads YARN killing executors?

2015-09-12 Thread unk1102
Hi I have the following Spark driver program/job which reads ORC files (i.e. hive partitions as HDFS directories) process them in DataFrame and use them as table in hiveContext.sql(). Job runs fine it gives correct results but it hits physical memory limit after one hour or so and YARN kills

How to create broadcast variable from Java String array?

2015-09-12 Thread unk1102
Hi I have Java String array which contains 45 string which is basically Schema String[] fieldNames = {"col1","col2",...}; Currently I am storing above array of String in a driver static field. My job is running slow so trying to refactor code I am using String array in creating DataFrame

How to enable Tungsten in Spark 1.5 for Spark SQL?

2015-09-10 Thread unk1102
Hi Spark 1.5 looks promising how do we enable project tungsten for spark sql or is it enabled by default please guide. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-Tungsten-in-Spark-1-5-for-Spark-SQL-tp24642.html Sent from the Apache Spark

Spark rdd.mapPartitionsWithIndex() hits physical memory limit after huge data shuffle

2015-09-09 Thread unk1102
Hi I have the following Spark code which involves huge data shuffling even though using mapPartitionswithIndex() with shuffle false. I have 2 TB of skewed data to process and then convert rdd into dataframe and use it as table in hiveContext.sql(). I am using 60 executors with 20 GB memory and 4

NPE while reading ORC file using Spark 1.4 API

2015-09-08 Thread unk1102
Hi I read many ORC files in Spark and process it those files are basically Hive partitions. Most of the times processing goes well but for few files I get the following exception dont know why? These files are working fine in Hive using Hive queries. Please guide. Thanks in advance. DataFrame df

Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

2015-09-04 Thread unk1102
Hi I have Spark job which does some processing on ORC data and stores back ORC data using DataFrameWriter save() API introduced in Spark 1.4.0. I have the following piece of code which is using heavy shuffle memory. How do I optimize below code? Is there anything wrong with it? It is working fine

Spark DataFrame saveAsTable with partitionBy creates no ORC file in HDFS

2015-09-02 Thread unk1102
Hi I have a Spark dataframe which I want to save as hive table with partitions. I tried the following two statements but they dont work I dont see any ORC files in HDFS directory its empty. I can see baseTable is there in Hive console but obviously its empty because of no files inside HDFS. The

What should be the optimal value for spark.sql.shuffle.partition?

2015-09-01 Thread unk1102
Hi I am using Spark SQL actually hiveContext.sql() which uses group by queries and I am running into OOM issues. So thinking of increasing value of spark.sql.shuffle.partition from 200 default to 1000 but it is not helping. Please correct me if I am wrong this partitions will share data shuffle

Spark executor OOM issue on YARN

2015-08-31 Thread unk1102
Hi I have Spark job and its executors hits OOM issue after some time and my job hangs because of it followed by couple of IOException, Rpc client disassociated, shuffle not found etc I have tried almost everything dont know how do I solve this OOM issue please guide I am fed up now. Here what I

Spark YARN executors are not launching when using +UseG1GC

2015-08-23 Thread unk1102
Hi I am hitting issue of long GC pauses in my Spark job and because of it YARN is killing executors one by one and Spark job becomes slower and slower. I came across this article where they mentioned about using G1GC I tried to use the same command but something seems wrong

Spark executor lost because of GC overhead limit exceeded even though using 20 executors using 25GB each

2015-08-18 Thread unk1102
Hi this GC overhead limit error is making me crazy. I have 20 executors using 25 GB each I dont understand at all how can it throw GC overhead I also dont that that big datasets. Once this GC error occurs in executor it will get lost and slowly other executors getting lost because of IOException,

Re: Apache Spark - Parallel Processing of messages from Kafka - Java

2015-08-17 Thread unk1102
val numStreams = 4 val kafkaStreams = (1 to numStreams).map { i = KafkaUtils.createStream(...) } In a Java in a for loop you will create four streams using KafkaUtils.createStream() so that each receiver will run in different threads for more information please visit

Calling hiveContext.sql(insert into table xyz...) in multiple threads?

2015-08-17 Thread unk1102
Hi I have around 2000 Hive source partitions to process and insert data into same table and different partition. For e.g. I have the following query hiveContext.sql(insert into table myTable partition(mypartition=someparition) bla bla) If I call above query in Spark driver program it runs fine

Spark executor lost because of time out even after setting quite long time out value 1000 seconds

2015-08-16 Thread unk1102
Hi I have written Spark job which seems to be working fine for almost an hour and after that executor start getting lost because of timeout I see the following in log statement 15/08/16 12:26:46 WARN spark.HeartbeatReceiver: Removing executor 10 with no recent heartbeats: 1051638 ms exceeds

Example code to spawn multiple threads in driver program

2015-08-16 Thread unk1102
Hi I have Spark driver program which has one loop which iterates for around 2000 times and for two thousands times it executes jobs in YARN. Since loop will do the job serially I want to introduce parallelism If I create 2000 tasks/runnable/callable in my Spark driver program will it get executed

How to use custom Hadoop InputFormat in DataFrame?

2015-08-10 Thread unk1102
Hi I have my own Hadoop custom InputFormat which I want to use in DataFrame. How do we do that? I know I can use sc.hadoopFile(..) but then how do I convert it into DataFrame JavaPairRDDVoid,MyRecordWritable myFormatAsPairRdd =

How to create DataFrame from a binary file?

2015-08-08 Thread unk1102
Hi how do we create DataFrame from a binary file stored in HDFS? I was thinking to use JavaPairRDDString,PortableDataStream pairRdd = javaSparkContext.binaryFiles(/hdfs/path/to/binfile); JavaRDDPortableDataStream javardd = pairRdd.values(); I can see that PortableDataStream has method called

Best practices to call hiveContext in DataFrame.foreach in executor program or how to have a for loop in driver program

2015-08-05 Thread unk1102
Hi I have the following code which fires hiveContext.sql() most of the time. My task is I want to create few table and insert values into after processing for all hive table partition. So I first fire show partitions and using its output in a for loop I call few methods which creates table if not

How to create Spark DataFrame using custom Hadoop InputFormat?

2015-07-31 Thread unk1102
Hi I am having my own Hadoop custom InputFormat which I need to use in creating DataFrame. I tried to do the following JavaPairRDDVoid,MyRecordWritable myFormatAsPairRdd = jsc.hadoopFile(hdfs://tmp/data/myformat.xyz,MyInputFormat.class,Void.class,MyRecordWritable.class); JavaRDDMyRecordWritable

How to control Spark Executors from getting Lost when using YARN client mode?

2015-07-30 Thread unk1102
Hi I have one Spark job which runs fine locally with less data but when I schedule it on YARN to execute I keep on getting the following ERROR and slowly all executors gets removed from UI and my job fails 15/07/30 10:18:13 ERROR cluster.YarnScheduler: Lost executor 8 on myhost1.com: remote Rpc

Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread unk1102
Hi I have Spark Streaming code which streams from Kafka topic it used to work fine but suddenly it started throwing the following exception Exception in thread main org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leader offsets for Set() at

Spark DataFrame created from JavaRDDRow copies all columns data into first column

2015-07-22 Thread unk1102
Hi I have a DataFrame which I need to convert into JavaRDD and back to DataFrame I have the following code DataFrame sourceFrame = hiveContext.read().format(orc).load(/path/to/orc/file); //I do order by in above sourceFrame and then I convert it into JavaRDD JavaRDDRow modifiedRDD =

SparkR sqlContext or sc not found in RStudio

2015-07-21 Thread unk1102
Hi I could successfully install SparkR package into my RStudio but I could not execute anything against sc or sqlContext. I did the following: Sys.setenv(SPARK_HOME=/path/to/sparkE1.4.1) .libPaths(c(file.path(Sys.getenv(SPARK_HOME),R,lib),.libPaths())) library(SparkR) Above code installs

Re: SparkR sqlContext or sc not found in RStudio

2015-07-21 Thread unk1102
Hi thanks for the reply. I did download from github build it and it is working fine I can use spark-submit etc when I use it in RStudio I dont know why it is saying sqlContext not found When I do the following sqlContext sparkRSQL.init(sc) Error: object sqlContext not found if I do the

What is the correct syntax of using Spark streamingContext.fileStream()?

2015-07-20 Thread unk1102
Hi I am trying to find correct way to use Spark Streaming API streamingContext.fileStream(String,ClassK,ClassV,ClassF) I tried to find example but could not find it anywhere in either Spark documentation. I have to stream files in hdfs which is of custom hadoop format.

Re: Store DStreams into Hive using Hive Streaming

2015-07-17 Thread unk1102
Hi I have similar use case did you found solution for this problem of loading DStreams in Hive using Spark Streaming. Please guide. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Store-DStreams-into-Hive-using-Hive-Streaming-tp18307p23885.html Sent

How to maintain multiple JavaRDD created within another method like javaStreamRDD.forEachRDD

2015-07-14 Thread unk1102
I use Spark Streaming where messages read from Kafka topics are stored into JavaDStreamString this rdd contains actual data. Now after going through documentation and other help I have found we traverse JavaDStream using foreachRDD javaDStreamRdd.foreachRDD(new FunctionJavaRDDlt;String,Void() {

Does Spark Streaming support streaming from a database table?

2015-07-13 Thread unk1102
Hi I did Kafka streaming through Spark streaming I have a use case where I would like to stream data from a database table. I see JDBCRDD is there but that is not what I am looking for I need continuous streaming like JavaSparkStreaming which continuously runs and listens to changes in a database