concurrent.RejectedExecutionException

2016-01-23 Thread Yasemin Kaya
Hi all, I'm using spark 1.5 and getting this error. Could you help i cant understand? 16/01/23 10:11:59 ERROR TaskSchedulerImpl: Exception in statusUpdate java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$anon$2@62c72719 rejected from

Re: concurrent.RejectedExecutionException

2016-01-23 Thread Ted Yu
This seems related: SPARK-8029 ShuffleMapTasks must be robust to concurrent attempts on the same executor Mind trying out 1.5.3 or later release ? Cheers On Sat, Jan 23, 2016 at 12:51 AM, Yasemin Kaya wrote: > Hi all, > > I'm using spark 1.5 and getting this error. Could

Re: Does filter on an RDD scan every data item ?

2016-01-23 Thread nir
"I don't think you could avoid this in general, right, in any system? " Really? nosql databases do efficient lookups(and scan) based on key and partition. look at cassandra, hbase -- View this message in context:

Re: [Streaming-Kafka] How to start from topic offset when streamcontext is using checkpoint

2016-01-23 Thread Yash Sharma
Hi Raju, Could you please explain your expected behavior with the DStream. The DStream will have event only from the 'fromOffsets' that you provided in the createDirectStream (which I think is the expected behavior). For the smaller files, you will have to deal with smaller files if you intend to

Spark not saving data to Hive

2016-01-23 Thread Akhilesh Pathodia
Hi, I am trying to write data from spark to Hive partitioned table: DataFrame dataFrame = sqlContext.createDataFrame(rdd, schema); dataFrame.write().partitionBy("YEAR","MONTH","DAY").saveAsTable(tableName); The data is not being written to hive table (hdfs location: /user/hive/warehouse//),

Re: Does filter on an RDD scan every data item ?

2016-01-23 Thread nir
Looks like this has been supported from 1.4 release :) https://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions -- View this message in context:

How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

2016-01-23 Thread Nirav Patel
Problem is I have RDD of about 10M rows and it keeps growing. Everytime when we want to perform query and compute on subset of data we have to use filter and then some aggregation. Here I know filter goes through each partitions and every rows of RDD which may not be efficient at all. Spark

Re: [Streaming-Kafka] How to start from topic offset when streamcontext is using checkpoint

2016-01-23 Thread Raju Bairishetti
Thanks for quick reply. I am creating Kafka Dstream by passing offsets map. I have pasted code snippet in my earlier mail. Let me know am I missing something. I want to use spark checkpoint for hand ng only driver/executor failures. On Jan 22, 2016 10:08 PM, "Cody Koeninger"

Clarification on Data Frames joins

2016-01-23 Thread Madabhattula Rajesh Kumar
Hi, I have a big database table(1 million plus records) in oracle. I need to query records based on input numbers. For this use case, I am doing below steps I am creating two data frames. DF1 = I am computing this DF1 using sql query. It has one million + records. DF2 = I have a list of

Re: python - list objects in HDFS directory

2016-01-23 Thread Ted Yu
Is 'hadoop' / 'hdfs' command accessible to your python script ? If so, you can call 'hdfs dfs -ls' from python. Cheers On Sat, Jan 23, 2016 at 4:08 AM, Andrew Holway < andrew.hol...@otternetworks.de> wrote: > Hello, > > I would like to make a list of files (parquet or json) in a specific >

Spark not writing data in Hive format

2016-01-23 Thread Akhilesh Pathodia
Hi, I am trying to write data from Spark to hive partitioned table. The job is running without any error, but it is not writing the data to correct location. job-executor-0] parquet.ParquetRelation (Logging.scala:logInfo(59)) - Listing

Concatenating tables

2016-01-23 Thread Andrew Holway
Is there a data frame operation to do this? +-+ | A B C D | +-+ | 1 2 3 4 | | 5 6 7 8 | +-+ +-+ | A B C D | +-+ | 3 5 6 8 | | 0 0 0 0 | +-+ +-+ | A B C D | +-+ | 8 8 8 8 | | 1 1 1 1 | +-+ Concatenated together to make this.

Re: Concatenating tables

2016-01-23 Thread Ted Yu
How about this operation : * Returns a new [[DataFrame]] containing union of rows in this frame and another frame. * This is equivalent to `UNION ALL` in SQL. * @group dfops * @since 1.3.0 */ def unionAll(other: DataFrame): DataFrame = withPlan { FYI On Sat, Jan 23, 2016 at

Re: Concatenating tables

2016-01-23 Thread Deenar Toraskar
On 23 Jan 2016 9:18 p.m., "Deenar Toraskar" < deenar.toras...@thinkreactive.co.uk> wrote: > Df.UnionAll(df2).unionall (df3) > On 23 Jan 2016 9:02 p.m., "Andrew Holway" > wrote: > >> Is there a data frame operation to do this? >> >> +-+ >> | A B C D | >>

Spark RDD DAG behaviour understanding in case of checkpointing

2016-01-23 Thread gaurav sharma
Hi Tathagata/Cody, I am facing a challenge in Production with DAG behaviour during checkpointing in spark streaming - Step 1 : Read data from Kafka every 15 min - call this KafkaStreamRDD ~ 100 GB of data Step 2 : Repartition KafkaStreamRdd from 5 to 100 partitions to parallelise processing -

Debug what is replication Level of which RDD

2016-01-23 Thread gaurav sharma
Hi All, I have enabled replication for my RDDs. I see on the Storage tab of the Spark UI, which mentions the replication level 2x or 1x. But the names given are mappedRDD, shuffledRDD, I am not able to debug which of my RDD is 2n replicated, and which one is 1x. Please help. Regards, Gaurab

Re: Spark Dataset doesn't have api for changing columns

2016-01-23 Thread Milad khajavi
How can I request for this API? See this closed issue: https://issues.apache.org/jira/browse/SPARK-12863 On Tue, Jan 19, 2016 at 10:12 PM, Michael Armbrust wrote: > In Spark 2.0 we are planning to combine DataFrame and Dataset so that all > the methods will be available