Re: Spark Streaming Job get killed after running for about 1 hour

2016-04-24 Thread أنس الليثي
I am using the latest Spark version 1.6 I have increased the maximum number of open files using this command *sysctl -w fs.file-max=3275782* Also I increased the limit for the user who run the spark job by updating the /etc/security/limits.conf file. Soft limit is 1024 and Hard limit is 65536. T

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-24 Thread Praveen Devarao
Thanks Reynold for the reason as to why sortBykey invokes a Job When you say "DataFrame/Dataset does not have this issue" is it right to assume you are referring to Spark 2.0 or Spark 1.6 DF already has built-in it? Thanking You --

Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-24 Thread Praveen Devarao
Hi, I have a streaming program with the block as below [ref: https://github.com/agsachin/streamingBenchmark/blob/master/spark-benchmarks/src/main/scala/TwitterStreaming.scala ] 1 val lines = messages.map(_._2) 2 val hashTags = lines.flatMap(status => status.split(" " ).filter(_.startsWit

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-24 Thread Reynold Xin
Usually no - but sortByKey does because it needs the range boundary to be built in order to have the RDD. It is a long standing problem that's unfortunately very difficult to solve without breaking the RDD API. In DataFrame/Dataset we don't have this issue though. On Sun, Apr 24, 2016 at 10:54 P

[Spark 1.5.2]All data being written to only one part file rest part files are empty

2016-04-24 Thread Divya Gehlot
Hi, After joining two dataframes, saving dataframe using Spark CSV. But all the result data is being written to only one part file whereas there are 200 part files being created, rest 199 part files are empty. What is the cause of uneven partitioning ? How can I evenly distribute the data ? Would

Re: Profiling memory use and access

2016-04-24 Thread Takeshi Yamamuro
Hi, You can use YourKit to profile workloads and please see: https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit // maropu On Mon, Apr 25, 2016 at 10:24 AM, Edmon Begoli wrote: > I am working on an experimental research into memory use and profiling of

Profiling memory use and access

2016-04-24 Thread Edmon Begoli
I am working on an experimental research into memory use and profiling of memory use and allocation by machine learning functions across number of popular libraries. Is there a facility within Spark, and MLlib specifically to track the allocation and use of data frames/memory by MLlib? Please adv

Re: executor delay in Spark

2016-04-24 Thread Jeff Zhang
Maybe this is due to config spark.scheduler.minRegisteredResourcesRatio, you can try set it as 1 to see the behavior. // Submit tasks only after (registered resources / total expected resources) // is equal to at least this value, that is double between 0 and 1. var minRegisteredRatio = math.m

Re: executor delay in Spark

2016-04-24 Thread Mike Hynes
Could you change numPartitions to {16, 32, 64} and run your program for each to see how many partitions are allocated to each worker? Let's see if you experience an all-nothing imbalance that way; if so, my guess is that something else is odd in your program logic or spark runtime environment, but

Convert DataFrame to Array of Arrays

2016-04-24 Thread Benjamin Kim
I have data in a DataFrame loaded from a CSV file. I need to load this data into HBase using an RDD formatted in a certain way. val rdd = sc.parallelize( Array(key1, (ColumnFamily, ColumnName1, Value1), (ColumnFamily, ColumnName2, Value2), (

Re: executor delay in Spark

2016-04-24 Thread Raghava Mutharaju
Mike, All, It turns out that the second time we encountered the uneven-partition issue is not due to spark-submit. It was resolved with the change in placement of count(). Case-1: val numPartitions = 8 // read uAxioms from HDFS, use hash partitioner on it and persist it // read type1Axioms from

Re: Using Aggregate and group by on spark Dataset api

2016-04-24 Thread Ted Yu
Have you taken a look at: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala On Sun, Apr 24, 2016 at 8:18 AM, coder wrote: > JavaRDD prdd = sc.textFile("c:\\fls\\people.txt").map( > new Function() { > public Person call(String line) throws Exception { >

Using Aggregate and group by on spark Dataset api

2016-04-24 Thread coder
JavaRDD prdd = sc.textFile("c:\\fls\\people.txt").map( new Function() { public Person call(String line) throws Exception { String[] parts = line.split(","); Person person = new Person(); person.setName(parts[0]);

Re: Spark Streaming Job get killed after running for about 1 hour

2016-04-24 Thread Rodrick Brown
We have similar jobs consuming from Kafka and writing to elastic search and the culprit is usually not enough memory for the executor or driver or not enough executors in general to process the job try using dynamic allocation if you're not too sure about how many cores/executors you actually ne

Re: Save DataFrame to HBase

2016-04-24 Thread Benjamin Kim
Hi Daniel, How did you get the Phoenix plugin to work? I have CDH 5.5.2 installed which comes with HBase 1.0.0 and Phoenix 4.5.2. Do you think this will work? Thanks, Ben > On Apr 24, 2016, at 1:43 AM, Daniel Haviv > wrote: > > Hi, > I tried saving DF to HBase using a hive table with hbase s

RE: How this unit test passed on master trunk?

2016-04-24 Thread Yong Zhang
So in that case then the result will be following: [1,[1,1]][3,[3,1]][2,[2,1]]Thanks for explaining the meaning of the it. But the question is that how first() will be [3,[1,1]]? In fact, if there were any ordering in the final result, it will be [1,[1,1]], instead of [3,[1,1]], correct? Yong S

Re: Spark Streaming Job get killed after running for about 1 hour

2016-04-24 Thread Ted Yu
Which version of Spark are you using ? How did you increase the open file limit ? Which operating system do you use ? Please see Example 6. ulimit Settings on Ubuntu under: http://hbase.apache.org/book.html#basic.prerequisites On Sun, Apr 24, 2016 at 2:34 AM, fanooos wrote: > I have a spark s

Fwd: Saving large textfile

2016-04-24 Thread Simon Hafner
2016-04-24 13:38 GMT+02:00 Stefan Falk : > sc.parallelize(cfile.toString() > .split("\n"), 1) Try `sc.textFile(pathToFile)` instead. >java.io.IOException: Broken pipe >at sun.nio.ch.FileDispatcherImpl.write0(Native Method) >at sun.nio.ch.SocketDispatcher.write(SocketD

Saving large textfile

2016-04-24 Thread Stefan Falk
I try to save a large text file of a approx. 5GB sc.parallelize(cfile.toString() .split("\n"), 1) .saveAsTextFile(new Path(path+".cs", "data").toUri.toString) but I keep getting java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method)

Spark Streaming Job get killed after running for about 1 hour

2016-04-24 Thread fanooos
I have a spark streaming job that read tweets stream from gnip and write it to Kafak. Spark and kafka are running on the same cluster. My cluster consists of 5 nodes. Kafka-b01 ... Kafka-b05 Spark master is running on Kafak-b05. Here is how we submit the spark job *nohup sh $SPZRK_HOME/bin/s

Re: Save DataFrame to HBase

2016-04-24 Thread Daniel Haviv
Hi, I tried saving DF to HBase using a hive table with hbase storage handler and hiveContext but it failed due to a bug. I was able to persist the DF to hbase using Apache Pheonix which was pretty simple. Thank you. Daniel > On 21 Apr 2016, at 16:52, Benjamin Kim wrote: > > Has anyone found