When you are doing a sc.textFile() you can actually pass the second
parameter which is the number of partitions.
Thanks
Best Regards
On Fri, Jun 26, 2015 at 12:40 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
How can i increase the number of tasks from 174 to 500 without running
repartition.
Nop I have not but I'm glad I'm not the only one :p
Le ven. 26 juin 2015 07:54, Tao Li litao.bupt...@gmail.com a écrit :
Hi Olivier, have you fix this problem now? I still have this fasterxml
NoSuchMethodError.
2015-06-18 3:08 GMT+08:00 Olivier Girardot
o.girar...@lateral-thoughts.com:
Hi, all
For now, spark is based on hadoop, I want use a database cluster instead of the
hadoop,
so data distributed on each database in the cluster.
I want to know if spark suitable for this situation ?
Any idea will be appreciated!
hi,
now I'm doing something like this on a data frame to make use of table
partitioning
df.filter($sex === male).write.parquet(path/to/table/sex=male)
df.filter($sex === female).write.parquet(path/to/table/sex=female)
this will filter dataset multiple times, are there better way to do this?
Thanks for the replies, guys.
Is this a permanent change as of 1.3, or will it go away at some point? Also,
does it require an entire Hadoop installation, or just WinUtils.exe?
Thanks,Ashic.
Date: Fri, 26 Jun 2015 18:22:03 +1000
Subject: Re: Recent spark sc.textFile needs hadoop for folders?!?
Hi,
I'm having trouble reading Bzip2 compressed sequence files after I
enabled hadoop native libraries in spark.
Running
LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/ $SPARK_HOME/bin/spark-submit
--class gives the following error
5/06/26 00:48:02 INFO CodecPool: Got brand-new decompressor
Try to add them in the SPARK_CLASSPATH in your conf/spark-env.sh file
Thanks
Best Regards
On Thu, Jun 25, 2015 at 9:31 PM, Bin Wang binwang...@gmail.com wrote:
I am trying to run the Spark example code HBaseTest from command line
using spark-submit instead run-example, in that case, I can
You just need to set your HADOOP_HOME which appears to be null in the
stackstrace. If you are not having the winutils.exe, then you can download
https://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip
and put it there.
Thanks
Best Regards
On Thu, Jun 25, 2015 at 11:30 PM, Ashic
Yes, Spark Core depends on Hadoop libs, and there is this unfortunate
twist on Windows. You'll still need HADOOP_HOME set appropriately
since Hadoop needs some special binaries to work on Windows.
On Fri, Jun 26, 2015 at 11:06 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
You just need to set
Which distributed database are you referring here? Spark can connect with
almost all those databases out there (You just need to pass the
Input/Output Format classes or there are a bunch of connectors also
available).
Thanks
Best Regards
On Fri, Jun 26, 2015 at 12:07 PM, louis.hust
It's a problem since 1.3 I think
On 26 Jun 2015 04:00, Ashic Mahtab as...@live.com wrote:
Hello,
Just trying out spark 1.4 (we're using 1.1 at present). On Windows, I've
noticed the following:
* On 1.4, sc.textFile(D:\\folder\\).collect() fails from both
spark-shell.cmd and when running a
Ok, but what does it means? I did not change the core files of spark, so is
it a bug there?
PS: on small datasets (500 Mb) I have no problem.
Am 25.06.2015 18:02 schrieb Ted Yu yuzhih...@gmail.com:
The assertion failure from TriangleCount.scala corresponds with the
following lines:
On 26 Jun 2015, at 09:29, Ashic Mahtab as...@live.commailto:as...@live.com
wrote:
Thanks for the replies, guys.
Is this a permanent change as of 1.3, or will it go away at some point?
Don't blame the spark team, complain to the hadoop team for being slow to
embrace the java 1.7 APIs for
Why do you want to do that?
Thanks
Best Regards
On Thu, Jun 25, 2015 at 10:16 PM, shahab shahab.mok...@gmail.com wrote:
Hi,
Apparently, sc.paralleize (..) operation is performed in the driver
program not in the workers ! Is it possible to do this in worker process
for the sake of
Its a scala version conflict, can you paste your build.sbt file?
Thanks
Best Regards
On Fri, Jun 26, 2015 at 7:05 AM, stati srikanth...@gmail.com wrote:
Hello,
When I run a spark job with spark-submit it fails with below exception for
code line
/*val webLogDF =
Hmm, not sure why, but when I run this code, it always keeps on consuming from
Kafka and proceeds ignoring the previous failed batches,
Also, Now that I get the attempt number from TaskContext and I have information
of max retries, I am supposed to handle it in the try/catch block, but does it
Hi,
How do we read files from multiple directories using newApiHadoopFile () ?
Thanks,
Baahu
--
Twitter:http://twitter.com/Baahu
See this related thread:
http://search-hadoop.com/m/q3RTtiYm8wgHego1
On Fri, Jun 26, 2015 at 9:43 AM, Bahubali Jain bahub...@gmail.com wrote:
Hi,
How do we read files from multiple directories using newApiHadoopFile () ?
Thanks,
Baahu
--
Twitter:http://twitter.com/Baahu
Forgot to mention: rank of 100 usually works ok, 120 consistently cannot
finish.
On Fri, Jun 26, 2015 at 10:18 AM, Ravi Mody rmody...@gmail.com wrote:
1. These are my settings:
rank = 100
iterations = 12
users = ~20M
items = ~2M
training examples = ~500M-1B (I'm running into the issue even
My imports:
import org.apache.avro.generic.GenericData
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.AvroKey
import org.apache.avro.Schema
import org.apache.hadoop.io.NullWritable
import org.apache.avro.mapreduce.AvroKeyInputFormat
import
No, if you have a bad message that you are continually throwing exceptions
on, your stream will not progress to future batches.
On Fri, Jun 26, 2015 at 10:28 AM, Amit Assudani aassud...@impetus.com
wrote:
Also, what I understand is, max failures doesn’t stop the entire stream,
it fails the
Thank you, Jie! Very nice work!
--
Nan Zhu
http://codingcat.me
On Friday, June 26, 2015 at 8:17 AM, Huang, Jie wrote:
Correct. Your calculation is right!
We have been aware of that kmeans performance drop also. According to our
observation, it is caused by some unbalanced
Hi ,
If i have a below data format , how can i use kafka direct stream to
de-serialize as i am not able to understand all the parameter i need to
pass , Can some one explain what will be the arguments as i am not clear
about this
JavaPairInputDStream
How to use Dependency Injection with Spark Java? Please could you point me
to any articles/frameworks?
Thanks!
Good morning,
I am having a bit of trouble finalizing the installation and usage of the
newest Spark version 1.4.0, deploying to an Amazon EC2 instance and using
RStudio to run on top of it.
Using these instructions (
http://spark.apache.org/docs/latest/ec2-scripts.html
Note that this problem is probably NOT caused directly by GraphX, but
GraphX reveals it because as you go further down the iterations, you get
further and further away of a shuffle you can rely on.
On Thu, Jun 25, 2015 at 7:43 PM, Thomas Gerber thomas.ger...@radius.com
wrote:
Hello,
We run
We don't have a documented way to use RStudio on EC2 right now. We have a
ticket open at https://issues.apache.org/jira/browse/SPARK-8596 to discuss
work-arounds and potential solutions for this.
Thanks
Shivaram
On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark m...@redoakstrategic.com
wrote:
Good
I understand that Kerberos support for accessing Hadoop resources in Spark only
works when running Spark on YARN. However, I'd really like to hack something
together for Spark on Mesos running alongside a secured Hadoop cluster. My
simplified appplication (gist:
OK, here’s how I did it, using just the built-in Avro libraries with Spark 1.3:
import org.apache.avro.generic.{GenericData, GenericRecord}
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.hadoop.io.NullWritable
import
Comma separated paths works only with spark 1.4 and up
2015-06-26 18:56 GMT+02:00 Eugen Cepoi cepoi.eu...@gmail.com:
You can comma separate them or use globbing patterns
2015-06-26 18:54 GMT+02:00 Ted Yu yuzhih...@gmail.com:
See this related thread:
Hi,
I am having this problem on spark 1.4. Do you have any ideas how to
solve it? I tried to use spark.executor.extraClassPath, but it did not help
BR,
Patcharee
On 04. mai 2015 23:47, Imran Rashid wrote:
Oh, this seems like a real pain. You should file a jira, I didn't see
an open issue
What master are you using? If this is not a local master, you'll need to
set LD_LIBRARY_PATH on the executors also (using
spark.executor.extraLibraryPath).
If you are using local, then I don't know what's going on.
On Fri, Jun 26, 2015 at 1:39 AM, Arunabha Ghosh arunabha...@gmail.com
wrote:
Hi Dave,
I don't understand Keeberos much but if you know the exact steps that needs to
happen I can see how we can make that happen with the Spark framework.
Tim
On Jun 26, 2015, at 8:49 AM, Dave Ariens dari...@blackberry.com wrote:
I understand that Kerberos support for accessing
Yeah, I ask because you might notice that by default the column types for CSV
tables read in by read.df() are only strings (due to limitations in type
inferencing in the DataBricks package). There was a separate discussion about
schema inferencing, and Shivaram recently merged support for
my question is why there are similar two parameter String.Class and
StringDecoder.class what is the difference each of them ?
Ashish
On Fri, Jun 26, 2015 at 8:53 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
JavaPairInputDStreamString, String messages =
KafkaUtils.createDirectStream(
dependency
groupIdorg.apache.avro/groupId
artifactIdavro/artifactId
version1.7.7/version
scopeprovided/scope
/dependency
dependency
groupIdcom.databricks/groupId
artifactIdspark-avro_2.10/artifactId
version1.0.0/version
/dependency
dependency
groupIdorg.apache.avro/groupId
I use the mllib not the ML. Does that make a difference ?
Sent from my iPhone
On Jun 26, 2015, at 7:19 AM, Ravi Mody rmody...@gmail.com wrote:
Forgot to mention: rank of 100 usually works ok, 120 consistently cannot
finish.
On Fri, Jun 26, 2015 at 10:18 AM, Ravi Mody rmody...@gmail.com
If you're consistently throwing exceptions and thus failing tasks, once you
reach max failures the whole stream will stop.
It's up to you to either catch those exceptions, or restart your stream
appropriately once it stops.
Keep in mind that if you're relying on checkpoints, and fixing the error
Problem: how do we recover from user errors (connectivity issues / storage
service down / etc.)?
Environment: Spark streaming using Kafka Direct Streams
Code Snippet:
HashSetString topicsSet = new HashSetString(Arrays.asList(kafkaTopic1));
HashMapString, String kafkaParams = new HashMapString,
Hi,
I just encountered the same problem, when I run a PageRank program which has
lots of stages(iterations)…
The master was lost after my program done.
And, the issue still remains even I increased driver memory.
Have any idea? e.g. how to increase the master memory?
Thanks.
Best,
Yifan
There is one for the key of your Kafka message and one for its value.
On 26 Jun 2015 4:21 pm, Ashish Soni asoni.le...@gmail.com wrote:
my question is why there are similar two parameter String.Class and
StringDecoder.class what is the difference each of them ?
Ashish
On Fri, Jun 26, 2015 at
I am trying the very same thing to configure min split size with Spark
1.3.1 and i get compilation error
Code:
val hadoopConfiguration = new Configuration(sc.hadoopConfiguration)
hadoopConfiguration.set(mapreduce.input.fileinputformat.split.maxsize,
67108864)
We've seen this issue as well in production. We also aren't sure what
causes it, but have just recently shaded some of the Spark code in
TaskSchedulerImpl that we use to effectively bubble up an exception from
Spark instead of zombie in this situation. If you are interested I can go
into more
Same code of yours works for me as well
On Fri, Jun 26, 2015 at 8:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Is that its not supported with Avro. Unlikely.
On Fri, Jun 26, 2015 at 8:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
wrote:
My imports:
import
Thanks for quick response,
My question here is how do I know that the max retries are done ( because in my
code I never know whether it is failure of first try or the last try ) and I
need to handle this message, is there any callback ?
Also, I know the limitation of checkpoint in upgrading
TaskContext has an attemptNumber method on it.
If you want to know which messages failed, you have access to the offsets,
and can do whatever you need to with them.
On Fri, Jun 26, 2015 at 10:21 AM, Amit Assudani aassud...@impetus.com
wrote:
Thanks for quick response,
My question here is
1. These are my settings:
rank = 100
iterations = 12
users = ~20M
items = ~2M
training examples = ~500M-1B (I'm running into the issue even with 500M
training examples)
2. The memory storage never seems to go too high. The user blocks may go up
to ~10Gb, and each executor will have a few GB used
All these throw compilation error at newAPIHadoopFile
1)
val hadoopConfiguration = new Configuration()
hadoopConfiguration.set(mapreduce.input.fileinputformat.split.maxsize,
67108864)
sc.newAPIHadoopFile[AvroKey, NullWritable, AvroKeyInputFormat](path +
/*.avro, classOf[AvroKey],
Make sure you’re importing the right namespace for Hadoop v2.0. This is what I
tried:
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.{FileInputFormat, TextInputFormat}
val hadoopConf = new org.apache.hadoop.conf.Configuration()
Is that its not supported with Avro. Unlikely.
On Fri, Jun 26, 2015 at 8:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
My imports:
import org.apache.avro.generic.GenericData
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.AvroKey
import
The receiver-based kafka createStream in spark 1.2 uses zookeeper to store
offsets. If you want finer-grained control over offsets, you can update
the values in zookeeper yourself before starting the job.
createDirectStream in spark 1.3 is still marked as experimental, and
subject to change.
No, they use the same implementation.
On Fri, Jun 26, 2015 at 8:05 AM, Ayman Farahat ayman.fara...@yahoo.com wrote:
I use the mllib not the ML. Does that make a difference ?
Sent from my iPhone
On Jun 26, 2015, at 7:19 AM, Ravi Mody rmody...@gmail.com wrote:
Forgot to mention: rank of 100
Please see my comments inline. It would be helpful if you can attach
the full stack trace. -Xiangrui
On Fri, Jun 26, 2015 at 7:18 AM, Ravi Mody rmody...@gmail.com wrote:
1. These are my settings:
rank = 100
iterations = 12
users = ~20M
items = ~2M
training examples = ~500M-1B (I'm running
Also, what I understand is, max failures doesn’t stop the entire stream, it
fails the job created for the specific batch, but the subsequent batches still
proceed, isn’t it right ? And question still remains, how to keep track of
those failed batches ?
From: amit assudani
These are my YARN queue configurations
Queue State:RUNNINGUsed Capacity:206.7%Absolute Used Capacity:3.1%Absolute
Capacity:1.5%Absolute Max Capacity:10.0%Used Resources:memory:5578496,
vCores:390Num Schedulable Applications:7Num Non-Schedulable Applications:0Num
Containers:390Max
Do we have any update on this thread? Has anyone met and solved similar
problems before?
Any pointers will be greatly appreciated!
Best,
XianXing
On Mon, Jun 15, 2015 at 11:48 PM, Jia Yu jia...@asu.edu wrote:
Hi Peng,
I got exactly same error! My shuffle data is also very large. Have you
So for each directory you create one RDD and then union them all.
On Fri, Jun 26, 2015 at 10:05 AM, Bahubali Jain bahub...@gmail.com wrote:
oh..my use case is not very straight forward.
The input can have multiple directories...
On Fri, Jun 26, 2015 at 9:30 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)
asked myself same question today...actually depends on what you are trying
to do
if you want injection into workers code I think it will be a bit hard...
if only in code that driver executes i.e. in main, it's straight forward
imho, just create your classes from injector(e.g. spring's application
So you created an EC2 instance with RStudio installed first, then installed
Spark under that same username? That makes sense, I just want to verify your
work flow.
Thank you again for your willingness to help!
On Fri, Jun 26, 2015 at 10:13 AM -0700, Shivaram Venkataraman
Silvio,
Thanks for your responses and patience. It worked after i reshuffled the
arguments and removed avro dependencies.
On Fri, Jun 26, 2015 at 9:55 AM, Silvio Fiorito
silvio.fior...@granturing.com wrote:
OK, here’s how I did it, using just the built-in Avro libraries with
Spark 1.3:
Hello ;
I checked on my partitions/storage and here is what I have
I have 80 executors
5 G per executore.
Do i need to set additional params
say cores
spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
#
With Python Pandas, it is easy to do concatenation of dataframes
by combining pandas.concat
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html
and pandas.read_csv
pd.concat([pd.read_csv(os.path.join(Path_to_csv_files, f)) for f in
csvfiles])
where csvfiles is the list of
Thanks for the awesome response, Steve.
As you say, it's not ideal, but the clarification greatly helps.
Cheers, everyone :)
-Ashic.
Subject: Re: Recent spark sc.textFile needs hadoop for folders?!?
From: ste...@hortonworks.com
To: as...@live.com
CC: guha.a...@gmail.com; user@spark.apache.org
Thanks!
In your demo video, were you using RStudio to hit a separate EC2 Spark cluster?
I noticed that it appeared your browser that you were using EC2 at that time,
so I was just curious. It appears that might be one of the possible
workarounds - fire up a separate EC2 instance with RStudio
Hi,
I am running a spark job over yarn, after 2-3 hr execution workers start
dieing and i found that a lot of file at
/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1435184713615_0008/blockmgr-333f0ade-2474-43a6-9960-f08a15bcc7b7/3f
named temp_shuffle.
my job is
Hi,
The response to the below thread for making yarn-client mode work by adding
the JDBC driver JAR to spark.{driver,executor}.extraClassPath works fine.
http://mail-archives.us.apache.org/mod_mbox/spark-user/201504.mbox/%3CCAAOnQ7vHeBwDU2_EYeMuQLyVZ77+N_jDGuinxOB=sff2lkc...@mail.gmail.com%3E
No worries, glad to help! It also helped me as I had not worked directly with
the Hadoop APIs for controlling splits.
From: ÐΞ€ρ@Ҝ (๏̯͡๏)
Date: Friday, June 26, 2015 at 1:31 PM
To: Silvio Fiorito
Cc: user
Subject: Re:
Silvio,
Thanks for your responses and patience. It worked after i reshuffled
Hi,
wanted to get some advice regarding tunning spark application
I see for some of the tasks many log entries like this
Executor task launch worker-38 ExternalAppendOnlyMap: Thread 239 spilling
in-memory map of 5.1 MB to disk (272 times so far)
(especially when inputs are considerable)
I
Hi,
we are on 1.3.1 right now so in case there are differences in the Spark
files I'll walk through the logic of what we did and post a couple gists at
the end. We haven't committed to forking Spark for our own deployments yet,
so right now we shadow some Spark classes in our application code
In rdd.mapPartition(...) if I try to iterate through the items in the
partition, everything screw. For example
val rdd = sc.parallelize(1 to 1000, 3)
val count = rdd.mapPartitions(iter = {
println(iter.length)
iter
}).count()
The count is 0. This is incorrect. The count should be 1000. If
I bring up spark streaming job that uses Kafka as input source.
No data to process and then shut it down. And bring it back again.
This time job does not start because it complains that DStream is not
initialized.
15/06/26 01:10:44 ERROR yarn.ApplicationMaster: User class threw exception:
Make sure you're following the docs regarding setting up a streaming
checkpoint.
Post your code if you can't get it figured out.
On Fri, Jun 26, 2015 at 3:45 PM, Ashish Nigam ashnigamt...@gmail.com
wrote:
I bring up spark streaming job that uses Kafka as input source.
No data to process and
Mesos do support running containers as specific users passed to it.
Thanks for chiming in, what else does YARN do with Kerberos besides keytab
file and user?
Tim
On Fri, Jun 26, 2015 at 1:20 PM, Marcelo Vanzin van...@cloudera.com wrote:
On Fri, Jun 26, 2015 at 1:13 PM, Tim Chen
On Fri, Jun 26, 2015 at 2:08 PM, Tim Chen t...@mesosphere.io wrote:
Mesos do support running containers as specific users passed to it.
Thanks for chiming in, what else does YARN do with Kerberos besides keytab
file and user?
The basic things I'd expect from a system to properly support
It used to work with 1.3.1, however with 1.4.0 i get the following exception
export SPARK_HOME=/home/dvasthimal/spark1.4/spark-1.4.0-bin-hadoop2.4
export
SPARK_JAR=/home/dvasthimal/spark1.4/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar
export
I would pretty much need exactly this kind of feature too
Le ven. 26 juin 2015 à 21:17, Dave Ariens dari...@blackberry.com a écrit :
Hi Timothy,
Because I'm running Spark on Mesos alongside a secured Hadoop cluster, I
need to ensure that my tasks running on the slaves perform a Kerberos
On Fri, Jun 26, 2015 at 1:13 PM, Tim Chen t...@mesosphere.io wrote:
So correct me if I'm wrong, sounds like all you need is a principal user
name and also a keytab file downloaded right?
I'm not familiar with Mesos so don't know what kinds of features it has,
but at the very least it would
Also, I get TaskContext.get() null when used in foreach function below ( I get
it when I use it in map, but the whole point here is to handle something that
is breaking in action ). Please help. :(
From: amit assudani aassud...@impetus.commailto:aassud...@impetus.com
Date: Friday, June 26, 2015
You can comma separate them or use globbing patterns
2015-06-26 18:54 GMT+02:00 Ted Yu yuzhih...@gmail.com:
See this related thread:
http://search-hadoop.com/m/q3RTtiYm8wgHego1
On Fri, Jun 26, 2015 at 9:43 AM, Bahubali Jain bahub...@gmail.com wrote:
Hi,
How do we read files from multiple
There's a few security related issues that I am postponing dealing with. Once
I get this working I'll look at the security side. Likely I'll be encouraging
users to submit their jobs via docker containers. Regardless, getting the
users keytab and principal name in the working environment
Hi,
- Spark 1.4 on a single node machine. Run spark-shell
- Reading from Parquet file with bunch of text columns and couple of
amounts in decimal(14,4). On disk size of of the file is 376M. It has ~100
million rows
- rdd1 = sqlcontext.read.parquet
- rdd1.cache
- group_by_df
I tried something similar and got oration error
I had 10 executors and 10 8 cores
ratings = newrdd.map(lambda l:
Rating(int(l[1]),int(l[2]),l[4])).partitionBy(50)
mypart = ratings.getNumPartitions()
mypart
50
numIterations =10
rank = 100
model = ALS.trainImplicit(ratings, rank,
Do you want to transform the RDD, or just produce some side effect with its
contents? If the latter, you want foreachPartition, not mapPartitions.
On Fri, Jun 26, 2015 at 11:52 AM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
In rdd.mapPartition(…) if I try to iterate through
Hi Timothy,
Because I'm running Spark on Mesos alongside a secured Hadoop cluster, I need
to ensure that my tasks running on the slaves perform a Kerberos login before
accessing any HDFS resources. To login, they just need the name of the
principal (username) and a keytab file. Then they
I set the number of partitions on the input dataset at 50. The number of
CPU cores I'm using is 84 (7 executors, 12 cores).
I'll look into getting a full stack trace. Any idea what my errors mean,
and why increasing memory causes them to go away? Thanks.
On Fri, Jun 26, 2015 at 11:26 AM,
how do i set these partitons? is this is the call to ALS
model = ALS.trainImplicit(ratings, rank, numIterations)?
On Jun 26, 2015, at 12:33 PM, Xiangrui Meng men...@gmail.com wrote:
So you have 100 partitions (blocks). This might be too many for your dataset.
Try setting a smaller number of
So correct me if I'm wrong, sounds like all you need is a principal user
name and also a keytab file downloaded right?
I'm adding support from spark framework to download additional files along
side your executor and driver, and one workaround is to specify a user
principal and keytab file that
Are you using yarn?
If yes increase the yarn memory overhead option. Yarn is probably killing
your executors.
Le 26 juin 2015 20:43, XianXing Zhang xianxing.zh...@gmail.com a écrit :
Do we have any update on this thread? Has anyone met and solved similar
problems before?
Any pointers will be
Hi,
I found out that the instructions for OpenBLAS has been changed by the author
of netlib-java in:
https://github.com/apache/spark/pull/4448 since Spark 1.3.0
In that PR, I asked whether there’s still a need to compile OpenBLAS with
USE_THREAD=0, and also about Intel MKL.
Is it still
Read the spark streaming guide ad the kafka integration guide for a better
understanding of how the receiver based stream works.
Capacity planning is specific to your environment and what the job is
actually doing, youll need to determine it empirically.
On Friday, June 26, 2015, Shushant Arora
Yeah, I noticed all columns are cast into strings. Thanks Alek for pointing
out the solution before I even encountered the problem.
2015-06-26 7:01 GMT-07:00 Eskilson,Aleksander alek.eskil...@cerner.com:
Yeah, I ask because you might notice that by default the column types
for CSV tables read
In 1.2 how to handle offset management after stream application starts in
each job . I should commit offset after job completion manually?
And what is recommended no of consumer threads. Say I have 300 partitions
in kafka cluster . Load is ~ 1 million events per second.Each event is of
~500bytes.
Yes we deployed Spark on top of Yarn.
What you suggested is very helpful, I increased the Yarn memory overhead
option and it helped in most cases. (Sometime it still has some failures
when the amount of data to be shuffled is large, but I guess if I continue
increasing the Yarn memory overhead
Here's code -
def createStreamingContext(checkpointDirectory: String) : StreamingContext
= {
val conf = new SparkConf().setAppName(KafkaConsumer)
conf.set(spark.eventLog.enabled, false)
logger.info(Going to init spark context)
conf.getOption(spark.master) match {
case
Not far at all. On large data sets everything simply fails with Spark.
Worst is am not able to figure out the reason of failure, the logs run
into millions of lines and i do not know the keywords to search for failure
reason
On Mon, Jun 15, 2015 at 6:52 AM, Night Wolf nightwolf...@gmail.com
we went through a similar process, switching from scalding (where
everything just works on large datasets) to spark (where it does not).
spark can be made to work on very large datasets, it just requires a little
more effort. pay attention to your storage levels (should be
memory-and-disk or
On Fri, Jun 26, 2015 at 3:09 PM, Dave Ariens dari...@blackberry.com wrote:
Would there be any way to have the task instances in the slaves call the
UGI login with a principal/keytab provided to the driver?
That would only work with a very small number of executors. If you have
many login
Fair. I will look into an alternative with a generated delegation token.
However the same issue exists. How can I have the executor run some arbitrary
code when it gets a task assignment and before it proceeds to process it's
resources?
From: Marcelo Vanzin
Sent: Friday, June 26, 2015 6:20
The scheduler configurations are helpful as well, but not useful without
the information outlined above.
-Sandy
On Fri, Jun 26, 2015 at 10:34 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
These are my YARN queue configurations
Queue State:RUNNINGUsed Capacity:206.7%Absolute Used
On Fri, Jun 26, 2015 at 3:44 PM, Dave Ariens dari...@blackberry.com wrote:
Fair. I will look into an alternative with a generated delegation token.
However the same issue exists. How can I have the executor run some
arbitrary code when it gets a task assignment and before it proceeds to
1 - 100 of 122 matches
Mail list logo