Re:

2015-06-26 Thread Akhil Das
When you are doing a sc.textFile() you can actually pass the second parameter which is the number of partitions. Thanks Best Regards On Fri, Jun 26, 2015 at 12:40 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: How can i increase the number of tasks from 174 to 500 without running repartition.

Re: Spark Shell Hive Context and Kerberos ticket

2015-06-26 Thread Olivier Girardot
Nop I have not but I'm glad I'm not the only one :p Le ven. 26 juin 2015 07:54, Tao Li litao.bupt...@gmail.com a écrit : Hi Olivier, have you fix this problem now? I still have this fasterxml NoSuchMethodError. 2015-06-18 3:08 GMT+08:00 Olivier Girardot o.girar...@lateral-thoughts.com:

Spark for distributed dbms cluster

2015-06-26 Thread louis.hust
Hi, all For now, spark is based on hadoop, I want use a database cluster instead of the hadoop, so data distributed on each database in the cluster. I want to know if spark suitable for this situation ? Any idea will be appreciated!

how to do table partitioning efficiently?

2015-06-26 Thread 诺铁
hi, now I'm doing something like this on a data frame to make use of table partitioning df.filter($sex === male).write.parquet(path/to/table/sex=male) df.filter($sex === female).write.parquet(path/to/table/sex=female) this will filter dataset multiple times, are there better way to do this?

RE: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Ashic Mahtab
Thanks for the replies, guys. Is this a permanent change as of 1.3, or will it go away at some point? Also, does it require an entire Hadoop installation, or just WinUtils.exe? Thanks,Ashic. Date: Fri, 26 Jun 2015 18:22:03 +1000 Subject: Re: Recent spark sc.textFile needs hadoop for folders?!?

Problem after enabling Hadoop native libraries

2015-06-26 Thread Arunabha Ghosh
Hi, I'm having trouble reading Bzip2 compressed sequence files after I enabled hadoop native libraries in spark. Running LD_LIBRARY_PATH=$HADOOP_HOME/lib/native/ $SPARK_HOME/bin/spark-submit --class gives the following error 5/06/26 00:48:02 INFO CodecPool: Got brand-new decompressor

Re: Problem Run Spark Example HBase Code Using Spark-Submit

2015-06-26 Thread Akhil Das
Try to add them in the SPARK_CLASSPATH in your conf/spark-env.sh file Thanks Best Regards On Thu, Jun 25, 2015 at 9:31 PM, Bin Wang binwang...@gmail.com wrote: I am trying to run the Spark example code HBaseTest from command line using spark-submit instead run-example, in that case, I can

Re: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Akhil Das
You just need to set your HADOOP_HOME which appears to be null in the stackstrace. If you are not having the winutils.exe, then you can download https://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip and put it there. Thanks Best Regards On Thu, Jun 25, 2015 at 11:30 PM, Ashic

Re: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Sean Owen
Yes, Spark Core depends on Hadoop libs, and there is this unfortunate twist on Windows. You'll still need HADOOP_HOME set appropriately since Hadoop needs some special binaries to work on Windows. On Fri, Jun 26, 2015 at 11:06 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You just need to set

Re: Spark for distributed dbms cluster

2015-06-26 Thread Akhil Das
Which distributed database are you referring here? Spark can connect with almost all those databases out there (You just need to pass the Input/Output Format classes or there are a bunch of connectors also available). Thanks Best Regards On Fri, Jun 26, 2015 at 12:07 PM, louis.hust

Re: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread ayan guha
It's a problem since 1.3 I think On 26 Jun 2015 04:00, Ashic Mahtab as...@live.com wrote: Hello, Just trying out spark 1.4 (we're using 1.1 at present). On Windows, I've noticed the following: * On 1.4, sc.textFile(D:\\folder\\).collect() fails from both spark-shell.cmd and when running a

Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-26 Thread Roman Sokolov
Ok, but what does it means? I did not change the core files of spark, so is it a bug there? PS: on small datasets (500 Mb) I have no problem. Am 25.06.2015 18:02 schrieb Ted Yu yuzhih...@gmail.com: The assertion failure from TriangleCount.scala corresponds with the following lines:

Re: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Steve Loughran
On 26 Jun 2015, at 09:29, Ashic Mahtab as...@live.commailto:as...@live.com wrote: Thanks for the replies, guys. Is this a permanent change as of 1.3, or will it go away at some point? Don't blame the spark team, complain to the hadoop team for being slow to embrace the java 1.7 APIs for

Re: Performing sc.paralleize (..) in workers not in the driver program

2015-06-26 Thread Akhil Das
Why do you want to do that? Thanks Best Regards On Thu, Jun 25, 2015 at 10:16 PM, shahab shahab.mok...@gmail.com wrote: Hi, Apparently, sc.paralleize (..) operation is performed in the driver program not in the workers ! Is it possible to do this in worker process for the sake of

Re: Spark 1.4 RDD to DF fails with toDF()

2015-06-26 Thread Akhil Das
Its a scala version conflict, can you paste your build.sbt file? Thanks Best Regards On Fri, Jun 26, 2015 at 7:05 AM, stati srikanth...@gmail.com wrote: Hello, When I run a spark job with spark-submit it fails with below exception for code line /*val webLogDF =

Re: How to recover in case user errors in streaming

2015-06-26 Thread Amit Assudani
Hmm, not sure why, but when I run this code, it always keeps on consuming from Kafka and proceeds ignoring the previous failed batches, Also, Now that I get the attempt number from TaskContext and I have information of max retries, I am supposed to handle it in the try/catch block, but does it

Multiple dir support : newApiHadoopFile

2015-06-26 Thread Bahubali Jain
Hi, How do we read files from multiple directories using newApiHadoopFile () ? Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Re: Multiple dir support : newApiHadoopFile

2015-06-26 Thread Ted Yu
See this related thread: http://search-hadoop.com/m/q3RTtiYm8wgHego1 On Fri, Jun 26, 2015 at 9:43 AM, Bahubali Jain bahub...@gmail.com wrote: Hi, How do we read files from multiple directories using newApiHadoopFile () ? Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ravi Mody
Forgot to mention: rank of 100 usually works ok, 120 consistently cannot finish. On Fri, Jun 26, 2015 at 10:18 AM, Ravi Mody rmody...@gmail.com wrote: 1. These are my settings: rank = 100 iterations = 12 users = ~20M items = ~2M training examples = ~500M-1B (I'm running into the issue even

Re:

2015-06-26 Thread ๏̯͡๏
My imports: import org.apache.avro.generic.GenericData import org.apache.avro.generic.GenericRecord import org.apache.avro.mapred.AvroKey import org.apache.avro.Schema import org.apache.hadoop.io.NullWritable import org.apache.avro.mapreduce.AvroKeyInputFormat import

Re: How to recover in case user errors in streaming

2015-06-26 Thread Cody Koeninger
No, if you have a bad message that you are continually throwing exceptions on, your stream will not progress to future batches. On Fri, Jun 26, 2015 at 10:28 AM, Amit Assudani aassud...@impetus.com wrote: Also, what I understand is, max failures doesn’t stop the entire stream, it fails the

Re: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Nan Zhu
Thank you, Jie! Very nice work! -- Nan Zhu http://codingcat.me On Friday, June 26, 2015 at 8:17 AM, Huang, Jie wrote: Correct. Your calculation is right! We have been aware of that kmeans performance drop also. According to our observation, it is caused by some unbalanced

Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Ashish Soni
Hi , If i have a below data format , how can i use kafka direct stream to de-serialize as i am not able to understand all the parameter i need to pass , Can some one explain what will be the arguments as i am not clear about this JavaPairInputDStream

Dependency Injection with Spark Java

2015-06-26 Thread Michal Čizmazia
How to use Dependency Injection with Spark Java? Please could you point me to any articles/frameworks? Thanks!

Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread RedOakMark
Good morning, I am having a bit of trouble finalizing the installation and usage of the newest Spark version 1.4.0, deploying to an Amazon EC2 instance and using RStudio to run on top of it. Using these instructions ( http://spark.apache.org/docs/latest/ec2-scripts.html

Re: GraphX - ConnectedComponents (Pregel) - longer and longer interval between jobs

2015-06-26 Thread Thomas Gerber
Note that this problem is probably NOT caused directly by GraphX, but GraphX reveals it because as you go further down the iterations, you get further and further away of a shuffle you can rely on. On Thu, Jun 25, 2015 at 7:43 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, We run

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread Shivaram Venkataraman
We don't have a documented way to use RStudio on EC2 right now. We have a ticket open at https://issues.apache.org/jira/browse/SPARK-8596 to discuss work-arounds and potential solutions for this. Thanks Shivaram On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark m...@redoakstrategic.com wrote: Good

Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
I understand that Kerberos support for accessing Hadoop resources in Spark only works when running Spark on YARN. However, I'd really like to hack something together for Spark on Mesos running alongside a secured Hadoop cluster. My simplified appplication (gist:

Re:

2015-06-26 Thread Silvio Fiorito
OK, here’s how I did it, using just the built-in Avro libraries with Spark 1.3: import org.apache.avro.generic.{GenericData, GenericRecord} import org.apache.avro.mapred.AvroKey import org.apache.avro.mapreduce.AvroKeyInputFormat import org.apache.hadoop.io.NullWritable import

Re: Multiple dir support : newApiHadoopFile

2015-06-26 Thread Eugen Cepoi
Comma separated paths works only with spark 1.4 and up 2015-06-26 18:56 GMT+02:00 Eugen Cepoi cepoi.eu...@gmail.com: You can comma separate them or use globbing patterns 2015-06-26 18:54 GMT+02:00 Ted Yu yuzhih...@gmail.com: See this related thread:

Re: Kryo serialization of classes in additional jars

2015-06-26 Thread patcharee
Hi, I am having this problem on spark 1.4. Do you have any ideas how to solve it? I tried to use spark.executor.extraClassPath, but it did not help BR, Patcharee On 04. mai 2015 23:47, Imran Rashid wrote: Oh, this seems like a real pain. You should file a jira, I didn't see an open issue

Re: Problem after enabling Hadoop native libraries

2015-06-26 Thread Marcelo Vanzin
What master are you using? If this is not a local master, you'll need to set LD_LIBRARY_PATH on the executors also (using spark.executor.extraLibraryPath). If you are using local, then I don't know what's going on. On Fri, Jun 26, 2015 at 1:39 AM, Arunabha Ghosh arunabha...@gmail.com wrote:

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Timothy Chen
Hi Dave, I don't understand Keeberos much but if you know the exact steps that needs to happen I can see how we can make that happen with the Spark framework. Tim On Jun 26, 2015, at 8:49 AM, Dave Ariens dari...@blackberry.com wrote: I understand that Kerberos support for accessing

Re: sparkR could not find function textFile

2015-06-26 Thread Eskilson,Aleksander
Yeah, I ask because you might notice that by default the column types for CSV tables read in by read.df() are only strings (due to limitations in type inferencing in the DataBricks package). There was a separate discussion about schema inferencing, and Shivaram recently merged support for

Re: Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Ashish Soni
my question is why there are similar two parameter String.Class and StringDecoder.class what is the difference each of them ? Ashish On Fri, Jun 26, 2015 at 8:53 AM, Akhil Das ak...@sigmoidanalytics.com wrote: ​JavaPairInputDStreamString, String messages = KafkaUtils.createDirectStream(

Re:

2015-06-26 Thread ๏̯͡๏
dependency groupIdorg.apache.avro/groupId artifactIdavro/artifactId version1.7.7/version scopeprovided/scope /dependency dependency groupIdcom.databricks/groupId artifactIdspark-avro_2.10/artifactId version1.0.0/version /dependency dependency groupIdorg.apache.avro/groupId

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ayman Farahat
I use the mllib not the ML. Does that make a difference ? Sent from my iPhone On Jun 26, 2015, at 7:19 AM, Ravi Mody rmody...@gmail.com wrote: Forgot to mention: rank of 100 usually works ok, 120 consistently cannot finish. On Fri, Jun 26, 2015 at 10:18 AM, Ravi Mody rmody...@gmail.com

Re: How to recover in case user errors in streaming

2015-06-26 Thread Cody Koeninger
If you're consistently throwing exceptions and thus failing tasks, once you reach max failures the whole stream will stop. It's up to you to either catch those exceptions, or restart your stream appropriately once it stops. Keep in mind that if you're relying on checkpoints, and fixing the error

How to recover in case user errors in streaming

2015-06-26 Thread Amit Assudani
Problem: how do we recover from user errors (connectivity issues / storage service down / etc.)? Environment: Spark streaming using Kafka Direct Streams Code Snippet: HashSetString topicsSet = new HashSetString(Arrays.asList(kafkaTopic1)); HashMapString, String kafkaParams = new HashMapString,

Re: Master dies after program finishes normally

2015-06-26 Thread Yifan LI
Hi, I just encountered the same problem, when I run a PageRank program which has lots of stages(iterations)… The master was lost after my program done. And, the issue still remains even I increased driver memory. Have any idea? e.g. how to increase the master memory? Thanks. Best, Yifan

Re: Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Benjamin Fradet
There is one for the key of your Kafka message and one for its value. On 26 Jun 2015 4:21 pm, Ashish Soni asoni.le...@gmail.com wrote: my question is why there are similar two parameter String.Class and StringDecoder.class what is the difference each of them ? Ashish On Fri, Jun 26, 2015 at

Re: hadoop input/output format advanced control

2015-06-26 Thread ๏̯͡๏
I am trying the very same thing to configure min split size with Spark 1.3.1 and i get compilation error Code: val hadoopConfiguration = new Configuration(sc.hadoopConfiguration) hadoopConfiguration.set(mapreduce.input.fileinputformat.split.maxsize, 67108864)

Re: Spark driver hangs on start of job

2015-06-26 Thread Richard Marscher
We've seen this issue as well in production. We also aren't sure what causes it, but have just recently shaded some of the Spark code in TaskSchedulerImpl that we use to effectively bubble up an exception from Spark instead of zombie in this situation. If you are interested I can go into more

Re:

2015-06-26 Thread ๏̯͡๏
Same code of yours works for me as well On Fri, Jun 26, 2015 at 8:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Is that its not supported with Avro. Unlikely. On Fri, Jun 26, 2015 at 8:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: My imports: import

Re: How to recover in case user errors in streaming

2015-06-26 Thread Amit Assudani
Thanks for quick response, My question here is how do I know that the max retries are done ( because in my code I never know whether it is failure of first try or the last try ) and I need to handle this message, is there any callback ? Also, I know the limitation of checkpoint in upgrading

Re: How to recover in case user errors in streaming

2015-06-26 Thread Cody Koeninger
TaskContext has an attemptNumber method on it. If you want to know which messages failed, you have access to the offsets, and can do whatever you need to with them. On Fri, Jun 26, 2015 at 10:21 AM, Amit Assudani aassud...@impetus.com wrote: Thanks for quick response, My question here is

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ravi Mody
1. These are my settings: rank = 100 iterations = 12 users = ~20M items = ~2M training examples = ~500M-1B (I'm running into the issue even with 500M training examples) 2. The memory storage never seems to go too high. The user blocks may go up to ~10Gb, and each executor will have a few GB used

Re:

2015-06-26 Thread ๏̯͡๏
All these throw compilation error at newAPIHadoopFile 1) val hadoopConfiguration = new Configuration() hadoopConfiguration.set(mapreduce.input.fileinputformat.split.maxsize, 67108864) sc.newAPIHadoopFile[AvroKey, NullWritable, AvroKeyInputFormat](path + /*.avro, classOf[AvroKey],

Re:

2015-06-26 Thread Silvio Fiorito
Make sure you’re importing the right namespace for Hadoop v2.0. This is what I tried: import org.apache.hadoop.io.{LongWritable, Text} import org.apache.hadoop.mapreduce.lib.input.{FileInputFormat, TextInputFormat} val hadoopConf = new org.apache.hadoop.conf.Configuration()

Re:

2015-06-26 Thread ๏̯͡๏
Is that its not supported with Avro. Unlikely. On Fri, Jun 26, 2015 at 8:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: My imports: import org.apache.avro.generic.GenericData import org.apache.avro.generic.GenericRecord import org.apache.avro.mapred.AvroKey import

Re: spark streaming with kafka reset offset

2015-06-26 Thread Cody Koeninger
The receiver-based kafka createStream in spark 1.2 uses zookeeper to store offsets. If you want finer-grained control over offsets, you can update the values in zookeeper yourself before starting the job. createDirectStream in spark 1.3 is still marked as experimental, and subject to change.

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Xiangrui Meng
No, they use the same implementation. On Fri, Jun 26, 2015 at 8:05 AM, Ayman Farahat ayman.fara...@yahoo.com wrote: I use the mllib not the ML. Does that make a difference ? Sent from my iPhone On Jun 26, 2015, at 7:19 AM, Ravi Mody rmody...@gmail.com wrote: Forgot to mention: rank of 100

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Xiangrui Meng
Please see my comments inline. It would be helpful if you can attach the full stack trace. -Xiangrui On Fri, Jun 26, 2015 at 7:18 AM, Ravi Mody rmody...@gmail.com wrote: 1. These are my settings: rank = 100 iterations = 12 users = ~20M items = ~2M training examples = ~500M-1B (I'm running

Re: How to recover in case user errors in streaming

2015-06-26 Thread Amit Assudani
Also, what I understand is, max failures doesn’t stop the entire stream, it fails the job created for the specific batch, but the subsequent batches still proceed, isn’t it right ? And question still remains, how to keep track of those failed batches ? From: amit assudani

Re: Executors requested are way less than what i actually got

2015-06-26 Thread ๏̯͡๏
These are my YARN queue configurations Queue State:RUNNINGUsed Capacity:206.7%Absolute Used Capacity:3.1%Absolute Capacity:1.5%Absolute Max Capacity:10.0%Used Resources:memory:5578496, vCores:390Num Schedulable Applications:7Num Non-Schedulable Applications:0Num Containers:390Max

Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-26 Thread XianXing Zhang
Do we have any update on this thread? Has anyone met and solved similar problems before? Any pointers will be greatly appreciated! Best, XianXing On Mon, Jun 15, 2015 at 11:48 PM, Jia Yu jia...@asu.edu wrote: Hi Peng, I got exactly same error! My shuffle data is also very large. Have you

Re: Unable to specify multiple directories as input

2015-06-26 Thread ๏̯͡๏
So for each directory you create one RDD and then union them all. On Fri, Jun 26, 2015 at 10:05 AM, Bahubali Jain bahub...@gmail.com wrote: oh..my use case is not very straight forward. The input can have multiple directories... On Fri, Jun 26, 2015 at 9:30 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)

Re: Dependency Injection with Spark Java

2015-06-26 Thread Igor Berman
asked myself same question today...actually depends on what you are trying to do if you want injection into workers code I think it will be a bit hard... if only in code that driver executes i.e. in main, it's straight forward imho, just create your classes from injector(e.g. spring's application

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread mark
So you created an EC2 instance with RStudio installed first, then installed Spark under that same username?  That makes sense, I just want to verify your work flow. Thank you again for your willingness to help! On Fri, Jun 26, 2015 at 10:13 AM -0700, Shivaram Venkataraman

Re:

2015-06-26 Thread ๏̯͡๏
Silvio, Thanks for your responses and patience. It worked after i reshuffled the arguments and removed avro dependencies. On Fri, Jun 26, 2015 at 9:55 AM, Silvio Fiorito silvio.fior...@granturing.com wrote: OK, here’s how I did it, using just the built-in Avro libraries with Spark 1.3:

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ayman Farahat
Hello ; I checked on my partitions/storage and here is what I have I have 80 executors 5 G per executore. Do i need to set additional params say cores spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g #

HOw to concatenate two csv files into one RDD?

2015-06-26 Thread Rex X
With Python Pandas, it is easy to do concatenation of dataframes by combining pandas.concat http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html and pandas.read_csv pd.concat([pd.read_csv(os.path.join(Path_to_csv_files, f)) for f in csvfiles]) where csvfiles is the list of

RE: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Ashic Mahtab
Thanks for the awesome response, Steve. As you say, it's not ideal, but the clarification greatly helps. Cheers, everyone :) -Ashic. Subject: Re: Recent spark sc.textFile needs hadoop for folders?!? From: ste...@hortonworks.com To: as...@live.com CC: guha.a...@gmail.com; user@spark.apache.org

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread Mark Stephenson
Thanks! In your demo video, were you using RStudio to hit a separate EC2 Spark cluster? I noticed that it appeared your browser that you were using EC2 at that time, so I was just curious. It appears that might be one of the possible workarounds - fire up a separate EC2 instance with RStudio

YARN worker out of disk memory

2015-06-26 Thread Tarun Garg
Hi, I am running a spark job over yarn, after 2-3 hr execution workers start dieing and i found that a lot of file at /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1435184713615_0008/blockmgr-333f0ade-2474-43a6-9960-f08a15bcc7b7/3f named temp_shuffle. my job is

Spark SQL - Setting YARN Classpath for primordial class loader

2015-06-26 Thread Kumaran Mani
Hi, The response to the below thread for making yarn-client mode work by adding the JDBC driver JAR to spark.{driver,executor}.extraClassPath works fine. http://mail-archives.us.apache.org/mod_mbox/spark-user/201504.mbox/%3CCAAOnQ7vHeBwDU2_EYeMuQLyVZ77+N_jDGuinxOB=sff2lkc...@mail.gmail.com%3E

Re:

2015-06-26 Thread Silvio Fiorito
No worries, glad to help! It also helped me as I had not worked directly with the Hadoop APIs for controlling splits. From: ÐΞ€ρ@Ҝ (๏̯͡๏) Date: Friday, June 26, 2015 at 1:31 PM To: Silvio Fiorito Cc: user Subject: Re: Silvio, Thanks for your responses and patience. It worked after i reshuffled

spilling in-memory map of 5.1 MB to disk (272 times so far)

2015-06-26 Thread igor.berman
Hi, wanted to get some advice regarding tunning spark application I see for some of the tasks many log entries like this Executor task launch worker-38 ExternalAppendOnlyMap: Thread 239 spilling in-memory map of 5.1 MB to disk (272 times so far) (especially when inputs are considerable) I

Re: Spark driver hangs on start of job

2015-06-26 Thread Richard Marscher
Hi, we are on 1.3.1 right now so in case there are differences in the Spark files I'll walk through the logic of what we did and post a couple gists at the end. We haven't committed to forking Spark for our own deployments yet, so right now we shadow some Spark classes in our application code

Cannot iterate items in rdd.mapPartition()

2015-06-26 Thread Wang, Ningjun (LNG-NPV)
In rdd.mapPartition(...) if I try to iterate through the items in the partition, everything screw. For example val rdd = sc.parallelize(1 to 1000, 3) val count = rdd.mapPartitions(iter = { println(iter.length) iter }).count() The count is 0. This is incorrect. The count should be 1000. If

spark streaming job fails to restart after checkpointing due to DStream initialization errors

2015-06-26 Thread Ashish Nigam
I bring up spark streaming job that uses Kafka as input source. No data to process and then shut it down. And bring it back again. This time job does not start because it complains that DStream is not initialized. 15/06/26 01:10:44 ERROR yarn.ApplicationMaster: User class threw exception:

Re: spark streaming job fails to restart after checkpointing due to DStream initialization errors

2015-06-26 Thread Cody Koeninger
Make sure you're following the docs regarding setting up a streaming checkpoint. Post your code if you can't get it figured out. On Fri, Jun 26, 2015 at 3:45 PM, Ashish Nigam ashnigamt...@gmail.com wrote: I bring up spark streaming job that uses Kafka as input source. No data to process and

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Tim Chen
Mesos do support running containers as specific users passed to it. Thanks for chiming in, what else does YARN do with Kerberos besides keytab file and user? Tim On Fri, Jun 26, 2015 at 1:20 PM, Marcelo Vanzin van...@cloudera.com wrote: On Fri, Jun 26, 2015 at 1:13 PM, Tim Chen

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Marcelo Vanzin
On Fri, Jun 26, 2015 at 2:08 PM, Tim Chen t...@mesosphere.io wrote: Mesos do support running containers as specific users passed to it. Thanks for chiming in, what else does YARN do with Kerberos besides keytab file and user? The basic things I'd expect from a system to properly support

Unable to start Pi (hello world) application on Spark 1.4

2015-06-26 Thread ๏̯͡๏
It used to work with 1.3.1, however with 1.4.0 i get the following exception export SPARK_HOME=/home/dvasthimal/spark1.4/spark-1.4.0-bin-hadoop2.4 export SPARK_JAR=/home/dvasthimal/spark1.4/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar export

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Olivier Girardot
I would pretty much need exactly this kind of feature too Le ven. 26 juin 2015 à 21:17, Dave Ariens dari...@blackberry.com a écrit : Hi Timothy, Because I'm running Spark on Mesos alongside a secured Hadoop cluster, I need to ensure that my tasks running on the slaves perform a Kerberos

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Marcelo Vanzin
On Fri, Jun 26, 2015 at 1:13 PM, Tim Chen t...@mesosphere.io wrote: So correct me if I'm wrong, sounds like all you need is a principal user name and also a keytab file downloaded right? I'm not familiar with Mesos so don't know what kinds of features it has, but at the very least it would

Re: How to recover in case user errors in streaming

2015-06-26 Thread Amit Assudani
Also, I get TaskContext.get() null when used in foreach function below ( I get it when I use it in map, but the whole point here is to handle something that is breaking in action ). Please help. :( From: amit assudani aassud...@impetus.commailto:aassud...@impetus.com Date: Friday, June 26, 2015

Re: Multiple dir support : newApiHadoopFile

2015-06-26 Thread Eugen Cepoi
You can comma separate them or use globbing patterns 2015-06-26 18:54 GMT+02:00 Ted Yu yuzhih...@gmail.com: See this related thread: http://search-hadoop.com/m/q3RTtiYm8wgHego1 On Fri, Jun 26, 2015 at 9:43 AM, Bahubali Jain bahub...@gmail.com wrote: Hi, How do we read files from multiple

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
There's a few security related issues that I am postponing dealing with. Once I get this working I'll look at the security side. Likely I'll be encouraging users to submit their jobs via docker containers. Regardless, getting the users keytab and principal name in the working environment

Spark 1.4 - memory bloat in group by/aggregate???

2015-06-26 Thread Manoj Samel
Hi, - Spark 1.4 on a single node machine. Run spark-shell - Reading from Parquet file with bunch of text columns and couple of amounts in decimal(14,4). On disk size of of the file is 376M. It has ~100 million rows - rdd1 = sqlcontext.read.parquet - rdd1.cache - group_by_df

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS : Too many values to unpack

2015-06-26 Thread Ayman Farahat
I tried something similar and got oration error I had 10 executors and 10 8 cores ratings = newrdd.map(lambda l: Rating(int(l[1]),int(l[2]),l[4])).partitionBy(50) mypart = ratings.getNumPartitions() mypart 50 numIterations =10 rank = 100 model = ALS.trainImplicit(ratings, rank,

Re: Cannot iterate items in rdd.mapPartition()

2015-06-26 Thread Mark Hamstra
Do you want to transform the RDD, or just produce some side effect with its contents? If the latter, you want foreachPartition, not mapPartitions. On Fri, Jun 26, 2015 at 11:52 AM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: In rdd.mapPartition(…) if I try to iterate through

RE: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
Hi Timothy, Because I'm running Spark on Mesos alongside a secured Hadoop cluster, I need to ensure that my tasks running on the slaves perform a Kerberos login before accessing any HDFS resources. To login, they just need the name of the principal (username) and a keytab file. Then they

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ravi Mody
I set the number of partitions on the input dataset at 50. The number of CPU cores I'm using is 84 (7 executors, 12 cores). I'll look into getting a full stack trace. Any idea what my errors mean, and why increasing memory causes them to go away? Thanks. On Fri, Jun 26, 2015 at 11:26 AM,

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ayman Farahat
how do i set these partitons? is this is the call to ALS model = ALS.trainImplicit(ratings, rank, numIterations)? On Jun 26, 2015, at 12:33 PM, Xiangrui Meng men...@gmail.com wrote: So you have 100 partitions (blocks). This might be too many for your dataset. Try setting a smaller number of

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Tim Chen
So correct me if I'm wrong, sounds like all you need is a principal user name and also a keytab file downloaded right? I'm adding support from spark framework to download additional files along side your executor and driver, and one workaround is to specify a user principal and keytab file that

Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-26 Thread Eugen Cepoi
Are you using yarn? If yes increase the yarn memory overhead option. Yarn is probably killing your executors. Le 26 juin 2015 20:43, XianXing Zhang xianxing.zh...@gmail.com a écrit : Do we have any update on this thread? Has anyone met and solved similar problems before? Any pointers will be

The usage of OpenBLAS

2015-06-26 Thread Tsai Li Ming
Hi, I found out that the instructions for OpenBLAS has been changed by the author of netlib-java in: https://github.com/apache/spark/pull/4448 since Spark 1.3.0 In that PR, I asked whether there’s still a need to compile OpenBLAS with USE_THREAD=0, and also about Intel MKL. Is it still

Re: spark streaming with kafka reset offset

2015-06-26 Thread Cody Koeninger
Read the spark streaming guide ad the kafka integration guide for a better understanding of how the receiver based stream works. Capacity planning is specific to your environment and what the job is actually doing, youll need to determine it empirically. On Friday, June 26, 2015, Shushant Arora

Re: sparkR could not find function textFile

2015-06-26 Thread Wei Zhou
Yeah, I noticed all columns are cast into strings. Thanks Alek for pointing out the solution before I even encountered the problem. 2015-06-26 7:01 GMT-07:00 Eskilson,Aleksander alek.eskil...@cerner.com: Yeah, I ask because you might notice that by default the column types for CSV tables read

Re: spark streaming with kafka reset offset

2015-06-26 Thread Shushant Arora
In 1.2 how to handle offset management after stream application starts in each job . I should commit offset after job completion manually? And what is recommended no of consumer threads. Say I have 300 partitions in kafka cluster . Load is ~ 1 million events per second.Each event is of ~500bytes.

Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-26 Thread XianXing Zhang
Yes we deployed Spark on top of Yarn. What you suggested is very helpful, I increased the Yarn memory overhead option and it helped in most cases. (Sometime it still has some failures when the amount of data to be shuffled is large, but I guess if I continue increasing the Yarn memory overhead

Re: spark streaming job fails to restart after checkpointing due to DStream initialization errors

2015-06-26 Thread Ashish Nigam
Here's code - def createStreamingContext(checkpointDirectory: String) : StreamingContext = { val conf = new SparkConf().setAppName(KafkaConsumer) conf.set(spark.eventLog.enabled, false) logger.info(Going to init spark context) conf.getOption(spark.master) match { case

Re: Join highly skewed datasets

2015-06-26 Thread ๏̯͡๏
Not far at all. On large data sets everything simply fails with Spark. Worst is am not able to figure out the reason of failure, the logs run into millions of lines and i do not know the keywords to search for failure reason On Mon, Jun 15, 2015 at 6:52 AM, Night Wolf nightwolf...@gmail.com

Re: Join highly skewed datasets

2015-06-26 Thread Koert Kuipers
we went through a similar process, switching from scalding (where everything just works on large datasets) to spark (where it does not). spark can be made to work on very large datasets, it just requires a little more effort. pay attention to your storage levels (should be memory-and-disk or

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Marcelo Vanzin
On Fri, Jun 26, 2015 at 3:09 PM, Dave Ariens dari...@blackberry.com wrote: Would there be any way to have the task instances in the slaves call the UGI login with a principal/keytab provided to the driver? That would only work with a very small number of executors. If you have many login

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
Fair. I will look into an alternative with a generated delegation token. However the same issue exists. How can I have the executor run some arbitrary code when it gets a task assignment and before it proceeds to process it's resources? From: Marcelo Vanzin Sent: Friday, June 26, 2015 6:20

Re: Executors requested are way less than what i actually got

2015-06-26 Thread Sandy Ryza
The scheduler configurations are helpful as well, but not useful without the information outlined above. -Sandy On Fri, Jun 26, 2015 at 10:34 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: These are my YARN queue configurations Queue State:RUNNINGUsed Capacity:206.7%Absolute Used

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Marcelo Vanzin
On Fri, Jun 26, 2015 at 3:44 PM, Dave Ariens dari...@blackberry.com wrote: Fair. I will look into an alternative with a generated delegation token. However the same issue exists. How can I have the executor run some arbitrary code when it gets a task assignment and before it proceeds to

  1   2   >