Please check the node manager logs to see why the container is killed.
On Mon, Aug 3, 2015 at 11:59 PM, Umesh Kacha umesh.ka...@gmail.com wrote:
Hi all any help will be much appreciated my spark job runs fine but in the
middle it starts loosing executors because of netafetchfailed exception
Yes...union would be one solution. I am not doing any aggregation hence
reduceByKey would not be useful. If I use groupByKey, messages with same
key would be obtained in a partition. But groupByKey is very expensive
operation as it involves shuffle operation. My ultimate goal is to write
the
It can not translate the number back to the word except you store the in
map by yourself.
2015-07-31 1:45 GMT+08:00 hans ziqiu li thenewh...@gmail.com:
Hello spark users!
I am having some troubles with the TFIDF in MLlib and was wondering if
anyone can point me to the right direction.
The
sorry, can't disclose info about my prod cluster
nothing jumps into my mind regarding your config
we don't use lz4 compression, don't know what is spark.deploy.spreadOut(there
is no documentation regarding this)
If you are sure that you don't have memory leak in your business logic I
would try
It looks like the predicted result just opposite with expectation, so could
you check whether the label is right?
Or could you share several data which can help to reproduce this output?
2015-08-03 19:36 GMT+08:00 Barak Gitsis bar...@similarweb.com:
hi,
I've run into some poor RF behavior,
Hi everyone,
I'm working with Spark Streaming, and I need to perform some offline
performance measures.
What I'd like to have is a CSV file that reports something like this:
*Batch number/timestampInput SizeTotal Delay*
which is in fact similar to what the UI outputs.
I tried
Hi Clark,
the problem is that in this dataset null values represented as NA
marker. Spark-csv doesn't have configurable null values marker (i've
made a PR with it some time ago:
https://github.com/databricks/spark-csv/pull/76).
So one option for you is to do post filtering, something like
Hello,
I try to magage NA in this dataset. I import my dataset with the
com.databricks.spark.csv package
When I do this: allyears2k.na.drop() I have no result.
Can you help me please ?
Regards,
--- The dataset -
dataset:
Hi there!
If you missed our webinar on IoT data ingestion in Spark with KaaIoT, see the
video and slides here: http://goo.gl/VMyQ1M
We recorded our webinar on “IoT data ingestion in Spark Streaming using Kaa”
for those who couldn’t see it live or who would like to refresh what they have
Hi Sadaf,
Im currently struggling with Twitter Streaming as well. I cant get it working
using the simple setup bellow. I use spark 1.2 and I replaced twitter4j v3 with
v4. Am I doing something wrong? How are you doing this?
twitter4j.conf.Configuration conf = new
Hi Sean,
Thanks for the information and clarification.
On Tue, Aug 4, 2015 at 1:04 PM, Sean Owen so...@cloudera.com wrote:
It won't affect you if you're not actually running Hadoop. But it's
mainly things like Snappy/LZO compression which are implemented as
native libraries under the hood.
If you want to do it through streaming API you have to pay Gnip; it's not free.
You can go through non-streaming Twitter API and convert it to stream yourself
though.
On 4 Aug 2015, at 09:29, Sadaf sa...@platalytics.com wrote:
Hi
Is there any way to get all old tweets since when the
If you are using Kafka, then you can basically push an entire file as a
message to Kafka. In that case in your DStream, you will receive the single
message which is the contents of the file and it can of course span
multiple lines.
Thanks
Best Regards
On Mon, Aug 3, 2015 at 8:27 PM, Spark
Hi all!
I want to skip first n rows from a dataframe? This is done in normal sql
using offset keyword. How can we achieve in spark sql?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/giving-offset-in-spark-sql-tp24130.html
Sent from the Apache
Hi,
Does spark SQL support Hive 0.14? The documentation refers to Hive 0.13. Is
there a way to compile spark with Hive 0.14?
Currently we are using Spark 1.3.1.
Thanks
--
View this message in context:
Can you elaborate about the things this native library covering.
One you mentioned accelerated compression.
It would be very helpful if you can give any useful to link to read more
about it.
On Tue, Aug 4, 2015 at 12:56 PM, Sean Owen so...@cloudera.com wrote:
You can ignore it entirely. It
Hi,
I am trying to read data from kafka and process it using spark.
i have attached my source code , error log.
For integrating kafka,
i have added dependency in pom.xml
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-streaming_2.10/artifactId
You can ignore it entirely. It just means you haven't installed and
configured native libraries for things like accelerated compression,
but it has no negative impact otherwise.
On Tue, Aug 4, 2015 at 8:11 AM, Deepesh Maheshwari
deepesh.maheshwar...@gmail.com wrote:
Hi,
When i run the spark
It won't affect you if you're not actually running Hadoop. But it's
mainly things like Snappy/LZO compression which are implemented as
native libraries under the hood. Spark doesn't necessarily use these
anyway; it's from the Hadoop libs.
On Tue, Aug 4, 2015 at 8:30 AM, Deepesh Maheshwari
One approach would be to use a Jobserver in between, create SparkContexts
in it. Lets say you create two, one which is configured to run on
coarse-grained and another set to fine-grained. Let the high priority jobs
hit the coarse-grained SparkContext and the other jobs use the fine-grained
one.
On Mon, Aug 3, 2015 at 9:00 AM, gen tang gen.tan...@gmail.com wrote:
Hi,
Recently, I met some problems about scheduler delay in pyspark. I worked
several days on this problem, but not success. Therefore, I come to here to
ask for help.
I have a key_value pair rdd like rdd[(key, list[dict])]
Sim did you find anything? :)
On Sun, Jul 26, 2015 at 9:31 AM, sim s...@swoop.com wrote:
The schema merging
http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging
section of the Spark SQL documentation shows an example of schema evolution
in a partitioned table.
Is
Hi,
When i run the spark locally on windows it gives below hadoop library error.
I am using below spark version.
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version1.4.1/version
/dependency
2015-08-04 12:22:23,463
Hi
Is there any way to get all old tweets since when the account was created
using spark streaming and twitters api? Currently my connector is showing
those tweets that get posted after the program runs. I've done this task
using spark streaming and a custom receiver using twitter user api.
Just to add rdd.take(1) won't trigger the entire computation, it will just
pull out the first record. You need to do a rdd.count() or rdd.saveAs*Files
to trigger the complete pipeline. How many partitions do you see in the
last stage?
Thanks
Best Regards
On Tue, Aug 4, 2015 at 7:10 AM, ayan guha
Hi,
I am new to Apache Spark and exploring spark+kafka intergration to process
data using spark which i did earlier in MongoDB Aggregation.
I am not able to figure out to handle my use case.
Mongo Document :
{
_id : ObjectId(55bfb3285e90ecbfe37b25c3),
url :
How much machines are there in your standalone cluster?
I am not using tachyon.
GC can not help me... Can anyone help ?
my configuration:
spark.deploy.spreadOut false
spark.eventLog.enabled true
spark.executor.cores 24
spark.ui.retainedJobs 10
spark.ui.retainedStages 10
(removing dev from the to: as not relevant)
it would be good to see some sample data and the cassandra schema to have a
more concrete idea of the problem space.
Some thoughts: reduceByKey could still be used to 'pick' one element.
example of arbitrarily choosing the first one: reduceByKey{case
Hi, I had the same problem and I didn't found the solution. I used Word2Vec
instead.
I am interessed by the solution of this problem of how to go back from the
TF-IDF hashing to word.
Regards,
Clark
Le Mardi 4 août 2015 13h03, Yanbo Liang yblia...@gmail.com a écrit :
It can not
Yes I upgraded but I would still like to set an overall stage timeout. Does
that exist?
On Fri, Jul 31, 2015 at 1:13 PM, Ted Yu yuzhih...@gmail.com wrote:
The referenced bug has been fixed in 1.4.0, are you able to upgrade ?
Cheers
On Fri, Jul 31, 2015 at 10:01 AM, William Kinney
It should be safe for Spark 1.4.1 and later versions.
Now Spark SQL adds a job-wise UUID to output file names to distinguish
files written by different write jobs. So those two write jobs you gave
should play well with each other. And the job committed later will
generate a summary file for
You need to import org.apache.spark.sql.SaveMode
Cheng
On 7/31/15 6:26 AM, satyajit vegesna wrote:
Hi,
I am new to using Spark and Parquet files,
Below is what i am trying to do, on Spark-shell,
val df =
sqlContext.parquetFile(/data/LM/Parquet/Segment/pages/part-m-0.gz.parquet)
Have
Those jobs will still be created for each valid time, they just may not
have many messages in them
On Mon, Aug 3, 2015 at 11:11 PM, Shushant Arora shushantaror...@gmail.com
wrote:
1.In spark 1.3(Non receiver) - If my batch interval is 1 sec and I don't
set
w.r.t. spark.deploy.spreadOut , here is the scaladoc:
// As a temporary workaround before better ways of configuring memory, we
allow users to set
// a flag that will perform round-robin scheduling across the nodes
(spreading out each app
// among all the nodes) instead of trying to
You will have to write your own consumer for pulling your custom feeds, and
then you can do a union (customfeedDstream.union(twitterStream)) with the
twitter stream api.
Thanks
Best Regards
On Tue, Aug 4, 2015 at 2:28 PM, Sadaf Khan sa...@platalytics.com wrote:
Thanks alot :)
One more thing
maybe try reducing spark.executor.cores
perhaps your tasks have large offheap overhead and better have less tasks
running in parallel
is it streaming job?
On Tue, Aug 4, 2015 at 2:14 PM Igor Berman igor.ber...@gmail.com wrote:
sorry, can't disclose info about my prod cluster
nothing jumps
Yes it does, in fact it's probably going to be one of the more expensive
shuffles you could trigger.
On Mon, Aug 3, 2015 at 12:56 PM, Meihua Wu rotationsymmetr...@gmail.com
wrote:
Does RDD.cartesian involve shuffling?
Thanks!
Hi Guru,
It was a no brainer issue. I had to create HDFS user ec2-user to make it
work. It worked like a charm after that.
Thanks
Upender
On Mon, Aug 3, 2015 at 10:27 PM, Guru Medasani gdm...@gmail.com wrote:
Hi Upen,
Did you deploy the client configs after assigning the gateway roles? You
One options is to use the coalesce method in the RDD class.
Mohammed
From: Brandon White [mailto:bwwintheho...@gmail.com]
Sent: Tuesday, August 4, 2015 7:23 PM
To: user
Subject: Combining Spark Files with saveAsTextFile
What is the best way to make saveAsTextFile save as only a single file?
Just to further clarify, you can first call coalesce with argument 1 and then
call saveAsTextFile. For example,
rdd.coalesce(1).saveAsTextFile(...)
Mohammed
From: Mohammed Guller
Sent: Tuesday, August 4, 2015 9:39 PM
To: 'Brandon White'; user
Subject: RE: Combining Spark Files with
Hi,
Does anyone know how I could control the number of reducer when we do operation
such as groupie For data frame?
I could set spark.sql.shuffle.partitions in sql but not sure how to do in
df.groupBy(XX) api.
Thanks,
Mike
How do you turn off gz compression for saving as textfiles? Right now, I am
reading ,gz files and it is saving them as .gz. I would love to not
compress them when I save.
1) DStream.saveAsTextFiles() //no compression
2) RDD.saveAsTextFile() //no compression
Any ideas?
The .gz extension indicates that the file is compressed with gzip. Choose a
different extension (e.g. .txt) when you save them.
On Tue, Aug 4, 2015 at 7:00 PM, Brandon White bwwintheho...@gmail.com
wrote:
How do you turn off gz compression for saving as textfiles? Right now, I
am reading ,gz
The old mllib API will use RandomForest.trainClassifier() to train a
RandomForestModel;
the new mllib API (AKA ML) will use RandomForestClassifier.train() to train
a RandomForestClassificationModel.
They will produce the same result for a given dataset.
2015-07-31 1:34 GMT+08:00 Bryan Cutler
And also https://issues.apache.org/jira/browse/SPARK-3106
This one is still open.
On Tue, Aug 4, 2015 at 6:12 PM, Jim Green openkbi...@gmail.com wrote:
*Symotom:*
Even sample job fails:
$ MASTER=spark://xxx:7077 run-example org.apache.spark.examples.SparkPi 10
Pi is roughly 3.140636
ERROR
Hi Spark users and developers,
I have been trying to use spark-ec2. After I launched the spark cluster
(1.4.1) with ephemeral hdfs (using hadoop 2.4.0), I tried to execute a job
where the data is stored in the ephemeral hdfs. It does not matter what I
tried to do, there is no data locality at
This should have been fixed by:
[SPARK-7943] [SPARK-8105] [SPARK-8435] [SPARK-8714] [SPARK-8561] Fixes
multi-database support
The fix is in the upcoming 1.5.0
FYI
On Tue, Aug 4, 2015 at 11:45 AM, Mohammed Guller moham...@glassbeam.com
wrote:
Hi -
I am running the Thrift JDBC/ODBC
The code is very simple, just a couple of lines. When i lanch it runs in
local but not in cluster.
sc = SparkContext(local, Tech Companies Feedback)
beginning_time = datetime.now()
time.sleep(60)
print datetime.now() - beginning_time
sc.stop()
Thanks for your interest,
Gonzalo
On Fri,
Elkhan,
Thank you for the response. This was a great answer.
On Tue, Aug 4, 2015 at 1:47 PM, Elkhan Dadashov elkhan8...@gmail.com
wrote:
Hi Connor,
Spark creates cached thread pool in Executor
Think it may be needed on Windows, certainly if you start trying to work with
local files.
On 4 Aug 2015, at 00:34, Sean Owen so...@cloudera.com wrote:
It won't affect you if you're not actually running Hadoop. But it's
mainly things like Snappy/LZO compression which are implemented as
I'll add that while Spark SQL 1.5 compiles against Hive 1.2.1, it has
support for reading from metastores for Hive 0.12 - 1.2.1
On Tue, Aug 4, 2015 at 9:59 AM, Steve Loughran ste...@hortonworks.com
wrote:
Spark 1.3.1 1.4 only support Hive 0.13
Spark 1.5 is going to be released against Hive
Hi -
I am running the Thrift JDBC/ODBC server (v1.4.1) and encountered a problem
when querying tables using fully qualified table names(schemaName.tableName).
The following query works fine from the beeline tool:
SELECT * from test;
However, the following query throws an exception, even
Thanks, Richard!
I basically have two RDD's: A and B; and I need to compute a value for
every pair of (a, b) for a in A and b in B. My first thought is
cartesian, but involves expensive shuffle.
Any alternatives? How about I convert B to an array and broadcast it
to every node (assuming B is
My application takes Twitter4j tweets and publishes those to a topic in
Kafka. Spark Streaming subscribes to that topic for processing. But in
actual, Spark Streaming is not able to receive tweet data from Kafka so
Spark Streaming is running empty batch jobs with out input and I am not able
to see
Yes, I rechecked and the label is correct. As you can see in the code
posted, I used the exact same features for all the classifiers so unless rf
somehow switches the labels, it should be correct.
I have posted a sample dataset and sample code to reproduce what I'm
getting here:
Have you tried using the console consumer to see if anything is actually
getting published to that topic?
On Tue, Aug 4, 2015 at 11:45 AM, narendra narencs...@gmail.com wrote:
My application takes Twitter4j tweets and publishes those to a topic in
Kafka. Spark Streaming subscribes to that
Hi,
I think the combination of Mongodb and Spark is a little bit unlucky.
Why don't you simply use mongodb?
If you want to process a lot of data you should use hdfs or cassandra as
storage. Mongodb is not suitable for heterogeneous processing of large
scale data.
Best regards
Best regards,
Hi,
it is possible to control the number of partitions for the RDD without
calling repartition by setting the max split size for the hadoop input
format used. Tracing through the code, XmlInputFormat extends
FileInputFormat which determines the number of splits (which NewHadoopRdd
uses to
Ideally the 2 messages read from kafka must differ on some parameter
atleast, or else they are logically same
As a solution to your problem, if the message content is same, u cud create
a new field UUID, which might play the role of partition key while
inserting the 2 messages in Cassandra
Msg1
That is the only alternative I'm aware of, if either A or B are small
enough to broadcast then you'd at least be done cartesian products all
locally without needing to also transmit and shuffle A. Unless spark
somehow optimizes cartesian product and only transfers the smaller RDD
across the
Spark 1.3.1 1.4 only support Hive 0.13
Spark 1.5 is going to be released against Hive 1.2.1; it'll skip Hive .14
support entirely and go straight to the currently supported Hive release.
See SPARK-8064 for the gory details
On 3 Aug 2015, at 23:01, Ishwardeep Singh
Hi Connor,
Spark creates cached thread pool in Executor
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala
for executing the tasks:
// Start worker thread pool
*private val threadPool = ThreadUtils.newDaemonCachedThreadPool(Executor
task
Oh good point, does the Windows integration need native libs for
POSIX-y file system access? I know there are some binaries shipped for
this purpose but wasn't sure if that's part of what's covered in the
native libs message.
On Tue, Aug 4, 2015 at 6:01 PM, Steve Loughran ste...@hortonworks.com
You'd need to provide information such as executor configuration (#cores,
memory size). You might have less scheduler delay with smaller, but more
numerous executors, than the contrary.
--
View this message in context:
I think you did a good job of summarizing terminology and describing
spark's operation. However #7 is inaccurate if I am interpreting correctly.
The scheduler schedules X tasks from the current stage across all
executors, where X is the the number of cores assigned to the application
(assuming
What is the best way to make saveAsTextFile save as only a single file?
*Symotom:*
Even sample job fails:
$ MASTER=spark://xxx:7077 run-example org.apache.spark.examples.SparkPi 10
Pi is roughly 3.140636
ERROR ConnectionManager: Corresponding SendingConnection to
ConnectionManagerId(xxx,) not found
WARN ConnectionManager: All connections not cleaned up
Found
67 matches
Mail list logo