What's the current status on adding slaves to a running cluster? I want to
leverage spark-ec2 and autoscaling groups. I want to launch slaves as spot
instances when I need to do some heavy lifting, but I don't want to bring
down my cluster in order to add nodes.
Can this be done by just running
Thanks Cody for very useful information.
It's much more clear to me now. I had a lots of wrong assumptions.
On Nov 23, 2015 10:19 PM, "Cody Koeninger" wrote:
> Partitioner is an optional field when defining an rdd. KafkaRDD doesn't
> define one, so you can't really assume
Hi Abhi,
You should be able to register a
org.apache.spark.streaming.scheduler.StreamListener.
There is an example here that may help:
https://gist.github.com/akhld/b10dc491aad1a2007183 and the spark api docs
here,
Hi Abhi,
Sorry that was the wrong link should have been the StreamListener,
http://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/scheduler/StreamingListener.html
The BatchInfo can be obtained from the event, for example:
public void
Hello,
I am using spark 1.4.1 with Zeppelin. When using the kryo serializer,
spark.serializer = org.apache.spark.serializer.KryoSerializer
instead of the default Java serializer I am getting the following error. Is
this a known issue?
Thanks,
Piero
java.io.IOException: Failed to connect to
Hi,
Does receiver based approach lose any data in case of a leader/broker loss
in Spark Streaming? We currently use Kafka Direct for Spark Streaming and it
seems to be failing out when there is a leader loss and we can't really
guarantee that there won't be any leader loss due rebalancing.
If
The direct stream shouldn't silently lose data in the case of a leader
loss. Loss of a leader is handled like any other failure, retrying
up to spark.task.maxFailures
times.
But really if you're losing leaders and taking that long to rebalance you
should figure out what's wrong with your
>
> Hi,
As a beginner ,I have below queries on Spork(Pig on Spark).
I have cloned git clone https://github.com/apache/pig -b spark .
1.On which version of Pig and Spark , Spork is being built ?
2. I followed the steps mentioned in https://issues.apache.org/ji
ra/browse/PIG-4059 and try to
Log files content :
Pig Stack Trace
---
ERROR 2998: Unhandled internal error. Could not initialize class
org.apache.spark.rdd.RDDOperationScope$
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.spark.rdd.RDDOperationScope$
at
I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 cluster
with 16.73 Tb storage, using
distcp. The dataset is a collection of tar files of about 1.7 Tb each.
Nothing else was stored in the HDFS, but after completing the download, the
namenode page says that 11.59 Tb are in use.
what is your hdfs replication set to?
On Wed, Nov 25, 2015 at 1:31 AM, AlexG wrote:
> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2
> cluster
> with 16.73 Tb storage, using
> distcp. The dataset is a collection of tar files of about 1.7 Tb each.
>
Hi AlexG:
Files(blocks more specifically) has 3 copies on HDFS by default. So 3.8 * 3 =
11.4TB.
--
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
On Wednesday, November 25, 2015 at 2:31 PM, AlexG wrote:
> I downloaded a 3.8 T dataset from S3 to a freshly launched
Hi,
I am using Hive 1.1.0 and Spark 1.5.1 and creating hive context in
spark-shell.
Now, I am experiencing reversed performance by Spark-Sql over Hive.
By default Hive gives result back in 27 seconds for plain select * query on
1 GB dataset containing 3623203 records, while spark-sql gives back
Hi All,
Hi Just an update on this case.
I try many different combination on settings (and I just upgrade to latest EMR
4.2.0 with Spark 1.5.2).
I just found out that the problem is from:
spark-submit --deploy-mode client --executor-cores=24 --driver-memory=5G
Ref:https://issues.apache.org/jira/browse/SPARK-11953
In Spark 1.3.1 we have 2 methods i.e.. CreateJdbcTable and InsertIntoJdbc
They are replaced with write.jdbc() in Spark 1.4.1
CreateJDBCTable allows to perform CREATE TABLE ... i.e... DDL on the table
followed by INSERT (DML)
InsertIntoJDBC
so basically writing them into a temporary directory named with the
batch time and then move the files to their destination on success ? I
wished there was a way to skip moving files around and be able to set
the output filenames.
Thanks Burak :)
-Michael
On Mon, Nov 23, 2015, at 09:19 PM,
I tried increasing spark.shuffle.io.maxRetries to 10 but didn't help.
This is the exception that I am getting:
[MySparkApplication] WARN : Failed to execute SQL statement select *
from TableS s join TableC c on s.property = c.property from X YZ
org.apache.spark.SparkException: Job
First of all, select * is not a useful SQL to evaluate. Very rarely would a
user require all 362K records for visual analysis.
Second, collect() forces movement of all data from executors to the driver.
Instead write it out to some other table or to HDFS.
Also Spark is more beneficial when you
Hi guys,
This may be a stupid question. But I m facing an issue here.
I found the class BinaryClassificationMetrics and I wanted to compute the
aucROC or aucPR of my model.
The thing is that the predict method of a LogisticRegressionModel only
returns the predicted class, and not the
Hi Prem,
Thank you for the details. I'm not able to build. I'm facing some issues.
Any repository link, where I can download (preview version of) 1.6
version of spark-core_2.11 and spark-sql_2.11 jar files.
Regards,
Rajesh
On Tue, Nov 24, 2015 at 6:03 PM, Prem Sure
See:
http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview=+ANNOUNCE+Spark+1+6+0+Release+Preview
On Tue, Nov 24, 2015 at 9:31 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:
> Hi Prem,
>
> Thank you for the details. I'm not able to build. I'm facing some issues.
>
> Any
Your reasoning is correct; you need probabilities (or at least some
score) out of the model and not just a 0/1 label in order for a ROC /
PR curve to have meaning.
But you just need to call clearThreshold() on the model to make it
return a probability.
On Tue, Nov 24, 2015 at 5:19 PM, jmvllt
Hi Ted,
I'm not able find "spark-core_2.11 and spark-sql_2.11 jar files" in above
link.
Regards,
Rajesh
On Tue, Nov 24, 2015 at 11:03 PM, Ted Yu wrote:
> See:
>
> http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview=+ANNOUNCE+Spark+1+6+0+Release+Preview
>
> On
thx for mentioning the build requirement
But actually it is -*D*scala-2.11 (i.e. -D for java property instead of
-P for profile)
details:
We can see this in the pom.xml
scala-2.11
scala-2.11
2.11.7
2.11
So the scala-2.11
Hi Sabarish
Thanks for the suggestion. I did not know about wholeTextFiles()
By the way once your suggestion about repartitioning was spot on!. My run
time for count() when from elapsed time:0:56:42.902407 to elapsed
time:0:00:03.215143 on a data set of about 34M of 4720 records.
Andy
From:
Hi Don
I went to a presentation given by Professor Ion Stoica. He mentioned that
Python was a little slower in general because of the type system. I do not
remember all of his comments. I think the context had to do with spark SQL
and data frames.
I wonder if the python issue is similar to the
Is it possible that the kafka offset api is somehow returning the wrong
offsets. Because each time the job fails for different partitions with an
error similar to the error that I get below.
Job aborted due to stage failure: Task 20 in stage 117.0 failed 4 times,
most recent failure: Lost task
Anything's possible, but that sounds pretty unlikely to me.
Are the partitions it's failing for all on the same leader?
Have there been any leader rebalances?
Do you have enough log retention?
If you log the offset for each message as it's processed, when do you see
the problem?
On Tue, Nov 24,
There is no codepath in the script /root/spark-ec2/spark/init.sh that can
actually get to the version of spark 1.5.2 pre-built with Hadoop 2.6. I
think the 2.4 version includes Hive as well... but setting hadoop major
version to 2 won't actually get you there.
Sigh. The documentation is the
HI Madabhattula
Scala 2.11 requires building from source. Prebuilt binaries are
available only for scala 2.10
>From the src folder:
dev/change-scala-version.sh 2.11
Then build as you would normally either from mvn or sbt
The above info *is* included in the spark docs but a little hard
See also:
https://repository.apache.org/content/repositories/orgapachespark-1162/org/apache/spark/spark-core_2.11/v1.6.0-preview2/
w.r.t. building locally, please specify -Pscala-2.11
Cheers
On Tue, Nov 24, 2015 at 9:58 AM, Stephen Boesch wrote:
> HI Madabhattula
>
I see the assertion error when I compare the offset ranges as shown below.
How do I log the offset for each message?
kafkaStream.transform { rdd =>
// Get the offset ranges in the RDD
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}.foreachRDD { rdd =>
for (o <-
I think you could have a Python UDF to turn the properties into JSON string:
import simplejson
def to_json(row):
return simplejson.dumps(row.asDict(recursive=Trye))
to_json_udf = pyspark.sql.funcitons.udf(to_json)
df.select("col_1", "col_2",
OK, yarn.scheduler.maximum-allocation-mb is 16384.
I have ran it again, the command to run it is:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
yarn-cluster -
-driver-memory 4g --executor-memory 8g lib/spark-examples*.jar 200
>
>
> 15/11/24 16:15:56 INFO
If yarn has only 50 cores then it can support max 49 executors plus 1
driver application master.
Regards
Sab
On 24-Nov-2015 1:58 pm, "谢廷稳" wrote:
> OK, yarn.scheduler.maximum-allocation-mb is 16384.
>
> I have ran it again, the command to run it is:
> ./bin/spark-submit
Did you set this configuration "spark.dynamicAllocation.initialExecutors" ?
You can set spark.dynamicAllocation.initialExecutors 50 to take try again.
I guess you might be hitting this issue since you're running 1.5.0,
https://issues.apache.org/jira/browse/SPARK-9092. But it still cannot
explain
The relevant error lines are:
Caused by: parquet.io.ParquetDecodingException: Can't read value in
column [roll_key] BINARY at value 19600 out of 4814, 19600 out of
19600 in currentPage. repetition level: 0, definition level: 1
Caused by: org.apache.spark.SparkException: Job aborted due to stage
@Sab Thank you for your reply, but the cluster has 6 nodes which contain
300 cores and Spark application did not request resource from YARN.
@SaiSai I have ran it successful with "
spark.dynamicAllocation.initialExecutors" equals 50, but in
The document is right. Because of a bug introduce in
https://issues.apache.org/jira/browse/SPARK-9092 which makes this
configuration fail to work.
It is fixed in https://issues.apache.org/jira/browse/SPARK-10790, you could
change to newer version of Spark.
On Tue, Nov 24, 2015 at 5:12 PM, 谢廷稳
Not sure who generally handles that, but I just made the edit.
On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal wrote:
> Sorry to be a nag, I realize folks with edit rights on the Powered by Spark
> page are very busy people, but its been 10 days since my original request,
>
HI,
When I create stream with KafkaUtils.createDirectStream I can explicitly define
the position "largest" or "smallest" - where to read topic from.
What if I have previous checkpoints( in HDFS for example) with offsets, and I
want to start reading from the last checkpoint?
In source code of
Thank you very much, after change to newer version, it did work well!
2015-11-24 17:15 GMT+08:00 Saisai Shao :
> The document is right. Because of a bug introduce in
> https://issues.apache.org/jira/browse/SPARK-9092 which makes this
> configuration fail to work.
>
> It
Thanks Christopher, I will try that.
Dan
On 20 November 2015 at 21:41, Bozeman, Christopher
wrote:
> Dan,
>
>
>
> Even though you may be adding more nodes to the cluster, the Spark
> application has to be requesting additional executors in order to thus use
> the added
Hello, I have a question about radix tree (PART) implementation in Spark,
IndexedRDD.
I explored the source code and found out that the Radix tree used in
IndexedRDD, only returns exact matches. However, it seems to have an
restricted use,
For example, I want to find children nodes using prefix
This is what a Radix tree returns
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/indexedrdd-and-radix-tree-how-to-search-indexedRDD-using-all-prefixes-tp25459p25460.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I just updated the page to say "email dev" instead of "email user".
On Tue, Nov 24, 2015 at 1:16 AM, Sean Owen wrote:
> Not sure who generally handles that, but I just made the edit.
>
> On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal wrote:
> > Sorry to
Hi,
I'm not able to build Spark 1.6 from source. Could you please share the
steps to build Spark 1.16
Regards,
Rajesh
I see, then this is actually irrelevant to Parquet. I guess can support
Joda DateTime in Spark SQL reflective schema inference to have this,
provided that this is a frequent use case and Spark SQL already has Joda
as a direct dependency.
On the other hand, if you are using Scala, you can
Cheng,
I am using Scala. I have an implicit conversion from Joda DateTime to
timestamp. My tables are defined with Timestamp. However explicit conversation
appears to be required. Do you have an example of implicit conversion for this
case? Do you convert on insert or on RDD to DF conversion?
Hi,
If you wish to read from checkpoints, you need to use
StreamingContext.getOrCreate(checkpointDir, functionToCreateContext) to
create the streaming context that you pass in to
KafkaUtils.createDirectStream(...). You may refer to
Cheng,
That’s exactly what I was hoping for – native support for writing DateTime
objects. As it stands Spark 1.5.2 seems to leave no option but to do manual
conversion (to nanos, Timestamp, etc) prior to writing records to hive.
Regards,
Bryan Jeffrey
Sent from Outlook Mail
From: Cheng
Great, thank you.
Sorry for being so inattentive) Need to read docs carefully.
--
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1=1
24.11.2015, 15:15, "Deng Ching-Mallete" :
> Hi,
>
> If you wish to read from checkpoints, you need to use
>
you can refer..:
https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn
On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:
> Hi,
>
> I'm not able to build Spark 1.6 from source. Could you please
You should consider using HBase as the NoSQL database.
w.r.t. 'The data in the DB should be indexed', you need to design the
schema in HBase carefully so that the retrieval is fast.
Disclaimer: I work on HBase.
On Tue, Nov 24, 2015 at 4:46 AM, sparkuser2345
wrote:
>
Thank you Sean, much appreciated.
And yes, perhaps "email dev" is a better option since the traffic is
(probably) lighter and these sorts of requests are more likely to get
noticed. Although one would need to subscribe to the dev list to do that...
-sujit
On Tue, Nov 24, 2015 at 1:16 AM, Sean
55 matches
Mail list logo