Hi all,
I'm using spark 1.5 and getting this error. Could you help i cant
understand?
16/01/23 10:11:59 ERROR TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task
org.apache.spark.scheduler.TaskResultGetter$$anon$2@62c72719 rejected from
This seems related:
SPARK-8029 ShuffleMapTasks must be robust to concurrent attempts on the
same executor
Mind trying out 1.5.3 or later release ?
Cheers
On Sat, Jan 23, 2016 at 12:51 AM, Yasemin Kaya wrote:
> Hi all,
>
> I'm using spark 1.5 and getting this error. Could
"I don't think you could avoid this
in general, right, in any system? "
Really? nosql databases do efficient lookups(and scan) based on key and
partition. look at cassandra, hbase
--
View this message in context:
Hi Raju,
Could you please explain your expected behavior with the DStream. The
DStream will have event only from the 'fromOffsets' that you provided in
the createDirectStream (which I think is the expected behavior).
For the smaller files, you will have to deal with smaller files if you
intend to
Hi,
I am trying to write data from spark to Hive partitioned table:
DataFrame dataFrame = sqlContext.createDataFrame(rdd, schema);
dataFrame.write().partitionBy("YEAR","MONTH","DAY").saveAsTable(tableName);
The data is not being written to hive table (hdfs location:
/user/hive/warehouse//),
Looks like this has been supported from 1.4 release :)
https://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions
--
View this message in context:
Problem is I have RDD of about 10M rows and it keeps growing. Everytime
when we want to perform query and compute on subset of data we have to use
filter and then some aggregation. Here I know filter goes through each
partitions and every rows of RDD which may not be efficient at all.
Spark
Thanks for quick reply.
I am creating Kafka Dstream by passing offsets map. I have pasted code
snippet in my earlier mail. Let me know am I missing something.
I want to use spark checkpoint for hand ng only driver/executor failures.
On Jan 22, 2016 10:08 PM, "Cody Koeninger"
Hi,
I have a big database table(1 million plus records) in oracle. I need to
query records based on input numbers. For this use case, I am doing below
steps
I am creating two data frames.
DF1 = I am computing this DF1 using sql query. It has one million +
records.
DF2 = I have a list of
Is 'hadoop' / 'hdfs' command accessible to your python script ?
If so, you can call 'hdfs dfs -ls' from python.
Cheers
On Sat, Jan 23, 2016 at 4:08 AM, Andrew Holway <
andrew.hol...@otternetworks.de> wrote:
> Hello,
>
> I would like to make a list of files (parquet or json) in a specific
>
Hi,
I am trying to write data from Spark to hive partitioned table. The job is
running without any error, but it is not writing the data to correct
location.
job-executor-0] parquet.ParquetRelation (Logging.scala:logInfo(59)) -
Listing
Is there a data frame operation to do this?
+-+
| A B C D |
+-+
| 1 2 3 4 |
| 5 6 7 8 |
+-+
+-+
| A B C D |
+-+
| 3 5 6 8 |
| 0 0 0 0 |
+-+
+-+
| A B C D |
+-+
| 8 8 8 8 |
| 1 1 1 1 |
+-+
Concatenated together to make this.
How about this operation :
* Returns a new [[DataFrame]] containing union of rows in this frame and
another frame.
* This is equivalent to `UNION ALL` in SQL.
* @group dfops
* @since 1.3.0
*/
def unionAll(other: DataFrame): DataFrame = withPlan {
FYI
On Sat, Jan 23, 2016 at
On 23 Jan 2016 9:18 p.m., "Deenar Toraskar" <
deenar.toras...@thinkreactive.co.uk> wrote:
> Df.UnionAll(df2).unionall (df3)
> On 23 Jan 2016 9:02 p.m., "Andrew Holway"
> wrote:
>
>> Is there a data frame operation to do this?
>>
>> +-+
>> | A B C D |
>>
Hi Tathagata/Cody,
I am facing a challenge in Production with DAG behaviour during
checkpointing in spark streaming -
Step 1 : Read data from Kafka every 15 min - call this KafkaStreamRDD ~
100 GB of data
Step 2 : Repartition KafkaStreamRdd from 5 to 100 partitions to parallelise
processing -
Hi All,
I have enabled replication for my RDDs.
I see on the Storage tab of the Spark UI, which mentions the replication
level 2x or 1x.
But the names given are mappedRDD, shuffledRDD, I am not able to debug
which of my RDD is 2n replicated, and which one is 1x.
Please help.
Regards,
Gaurab
How can I request for this API?
See this closed issue: https://issues.apache.org/jira/browse/SPARK-12863
On Tue, Jan 19, 2016 at 10:12 PM, Michael Armbrust
wrote:
> In Spark 2.0 we are planning to combine DataFrame and Dataset so that all
> the methods will be available
17 matches
Mail list logo