date:20160424

Re: Spark Streaming Job get killed after running for about 1 hour

2016-04-24 Thread أنس الليثي

I am using the latest Spark version 1.6 I have increased the maximum number of open files using this command *sysctl -w fs.file-max=3275782* Also I increased the limit for the user who run the spark job by updating the /etc/security/limits.conf file. Soft limit is 1024 and Hard limit is 65536. T

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-24 Thread Praveen Devarao

Thanks Reynold for the reason as to why sortBykey invokes a Job When you say "DataFrame/Dataset does not have this issue" is it right to assume you are referring to Spark 2.0 or Spark 1.6 DF already has built-in it? Thanking You --

Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-24 Thread Praveen Devarao

Hi, I have a streaming program with the block as below [ref: https://github.com/agsachin/streamingBenchmark/blob/master/spark-benchmarks/src/main/scala/TwitterStreaming.scala ] 1 val lines = messages.map(_._2) 2 val hashTags = lines.flatMap(status => status.split(" " ).filter(_.startsWit

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

2016-04-24 Thread Reynold Xin

Usually no - but sortByKey does because it needs the range boundary to be built in order to have the RDD. It is a long standing problem that's unfortunately very difficult to solve without breaking the RDD API. In DataFrame/Dataset we don't have this issue though. On Sun, Apr 24, 2016 at 10:54 P

[Spark 1.5.2]All data being written to only one part file rest part files are empty

2016-04-24 Thread Divya Gehlot

Hi, After joining two dataframes, saving dataframe using Spark CSV. But all the result data is being written to only one part file whereas there are 200 part files being created, rest 199 part files are empty. What is the cause of uneven partitioning ? How can I evenly distribute the data ? Would

Re: Profiling memory use and access

2016-04-24 Thread Takeshi Yamamuro

Hi, You can use YourKit to profile workloads and please see: https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit // maropu On Mon, Apr 25, 2016 at 10:24 AM, Edmon Begoli wrote: > I am working on an experimental research into memory use and profiling of

Profiling memory use and access

2016-04-24 Thread Edmon Begoli

I am working on an experimental research into memory use and profiling of memory use and allocation by machine learning functions across number of popular libraries. Is there a facility within Spark, and MLlib specifically to track the allocation and use of data frames/memory by MLlib? Please adv

Re: executor delay in Spark

2016-04-24 Thread Jeff Zhang

Maybe this is due to config spark.scheduler.minRegisteredResourcesRatio, you can try set it as 1 to see the behavior. // Submit tasks only after (registered resources / total expected resources) // is equal to at least this value, that is double between 0 and 1. var minRegisteredRatio = math.m

Re: executor delay in Spark

2016-04-24 Thread Mike Hynes

Could you change numPartitions to {16, 32, 64} and run your program for each to see how many partitions are allocated to each worker? Let's see if you experience an all-nothing imbalance that way; if so, my guess is that something else is odd in your program logic or spark runtime environment, but

Convert DataFrame to Array of Arrays

2016-04-24 Thread Benjamin Kim

I have data in a DataFrame loaded from a CSV file. I need to load this data into HBase using an RDD formatted in a certain way. val rdd = sc.parallelize( Array(key1, (ColumnFamily, ColumnName1, Value1), (ColumnFamily, ColumnName2, Value2), (

Re: executor delay in Spark

2016-04-24 Thread Raghava Mutharaju

Mike, All, It turns out that the second time we encountered the uneven-partition issue is not due to spark-submit. It was resolved with the change in placement of count(). Case-1: val numPartitions = 8 // read uAxioms from HDFS, use hash partitioner on it and persist it // read type1Axioms from

Re: Using Aggregate and group by on spark Dataset api

2016-04-24 Thread Ted Yu

Have you taken a look at: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala On Sun, Apr 24, 2016 at 8:18 AM, coder wrote: > JavaRDD prdd = sc.textFile("c:\\fls\\people.txt").map( > new Function() { > public Person call(String line) throws Exception { >

Using Aggregate and group by on spark Dataset api

2016-04-24 Thread coder

JavaRDD prdd = sc.textFile("c:\\fls\\people.txt").map( new Function() { public Person call(String line) throws Exception { String[] parts = line.split(","); Person person = new Person(); person.setName(parts[0]);

Re: Spark Streaming Job get killed after running for about 1 hour

2016-04-24 Thread Rodrick Brown

We have similar jobs consuming from Kafka and writing to elastic search and the culprit is usually not enough memory for the executor or driver or not enough executors in general to process the job try using dynamic allocation if you're not too sure about how many cores/executors you actually ne

Re: Save DataFrame to HBase

2016-04-24 Thread Benjamin Kim

Hi Daniel, How did you get the Phoenix plugin to work? I have CDH 5.5.2 installed which comes with HBase 1.0.0 and Phoenix 4.5.2. Do you think this will work? Thanks, Ben > On Apr 24, 2016, at 1:43 AM, Daniel Haviv > wrote: > > Hi, > I tried saving DF to HBase using a hive table with hbase s

RE: How this unit test passed on master trunk?

2016-04-24 Thread Yong Zhang

So in that case then the result will be following: [1,[1,1]][3,[3,1]][2,[2,1]]Thanks for explaining the meaning of the it. But the question is that how first() will be [3,[1,1]]? In fact, if there were any ordering in the final result, it will be [1,[1,1]], instead of [3,[1,1]], correct? Yong S

Re: Spark Streaming Job get killed after running for about 1 hour

2016-04-24 Thread Ted Yu

Which version of Spark are you using ? How did you increase the open file limit ? Which operating system do you use ? Please see Example 6. ulimit Settings on Ubuntu under: http://hbase.apache.org/book.html#basic.prerequisites On Sun, Apr 24, 2016 at 2:34 AM, fanooos wrote: > I have a spark s

Fwd: Saving large textfile

2016-04-24 Thread Simon Hafner

2016-04-24 13:38 GMT+02:00 Stefan Falk : > sc.parallelize(cfile.toString() > .split("\n"), 1) Try `sc.textFile(pathToFile)` instead. >java.io.IOException: Broken pipe >at sun.nio.ch.FileDispatcherImpl.write0(Native Method) >at sun.nio.ch.SocketDispatcher.write(SocketD

Saving large textfile

2016-04-24 Thread Stefan Falk

I try to save a large text file of a approx. 5GB sc.parallelize(cfile.toString() .split("\n"), 1) .saveAsTextFile(new Path(path+".cs", "data").toUri.toString) but I keep getting java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method)

Spark Streaming Job get killed after running for about 1 hour

2016-04-24 Thread fanooos

I have a spark streaming job that read tweets stream from gnip and write it to Kafak. Spark and kafka are running on the same cluster. My cluster consists of 5 nodes. Kafka-b01 ... Kafka-b05 Spark master is running on Kafak-b05. Here is how we submit the spark job *nohup sh $SPZRK_HOME/bin/s

Re: Save DataFrame to HBase

2016-04-24 Thread Daniel Haviv

Hi, I tried saving DF to HBase using a hive table with hbase storage handler and hiveContext but it failed due to a bug. I was able to persist the DF to hbase using Apache Pheonix which was pretty simple. Thank you. Daniel > On 21 Apr 2016, at 16:52, Benjamin Kim wrote: > > Has anyone found

Re: Spark Streaming Job get killed after running for about 1 hour

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

Do transformation functions on RDD invoke a Job [sc.runJob]?

Re: Do transformation functions on RDD invoke a Job [sc.runJob]?

[Spark 1.5.2]All data being written to only one part file rest part files are empty

Re: Profiling memory use and access

Profiling memory use and access

Re: executor delay in Spark

Re: executor delay in Spark

Convert DataFrame to Array of Arrays

Re: executor delay in Spark

Re: Using Aggregate and group by on spark Dataset api

Using Aggregate and group by on spark Dataset api

Re: Spark Streaming Job get killed after running for about 1 hour

Re: Save DataFrame to HBase

RE: How this unit test passed on master trunk?

Re: Spark Streaming Job get killed after running for about 1 hour

Fwd: Saving large textfile

Saving large textfile

Spark Streaming Job get killed after running for about 1 hour

Re: Save DataFrame to HBase

21 matches

Site Navigation

Mail list logo

Footer information