Re: Dataframe fails for large resultsize

2016-04-29 Thread Buntu Dev
Thanks Krishna, but I believe the memory consumed on the executors is being exhausted in my case. I've allocated the max 10g that I can to both the driver and executors. Are there any alternatives solutions to fetching the top 1M rows after ordering the dataset? Thanks! On Fri, Apr 29, 2016 at

Re: Dataframe fails for large resultsize

2016-04-29 Thread Krishna
I recently encountered similar network related errors and was able to fix it by applying the ethtool updates decribed here [ https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-5085] On Friday, April 29, 2016, Buntu Dev wrote: > Just to provide more details, I

Re: Timeout when submitting an application to a remote Spark Standalone master

2016-04-29 Thread Richard Han
So connecting to the cluster (port 7077) works. That is to say, the Spark Context is created. The timeout occurs on the worker side when I run any command with .collect(). The client (local machine) basically waits forever). I'm wondering if maybe I'm not understanding the architecture correctly

Re: Timeout when submitting an application to a remote Spark Standalone master

2016-04-29 Thread Femi Anthony
Have you tried connecting to the port 7077 on the cluster from your local machine to see if it works ok ? Sent from my iPhone > On Apr 29, 2016, at 5:58 PM, Richard Han wrote: > > I have an EC2 installation of Spark Standalone Master/Worker set up. The two > can talk to

Timeout when submitting an application to a remote Spark Standalone master

2016-04-29 Thread Richard Han
I have an EC2 installation of Spark Standalone Master/Worker set up. The two can talk to one another, and all ports are open in the security group (just to make sure it isn't the port). When I run spark-shell on the master node (setting it to --master spark://ip:7077) it runs everything correctly.

DropDuplicates Behavior

2016-04-29 Thread Allen George
I'd like to echo a question that was asked earlier this year: If we do a global sort of a dataframe (with two columns: col_1, col_2) by (col_1, col_2/desc) and then dropDuplicates on col_1, will it retain the first row of each sorted group? i.e. Will it return the row with the greatest value of

Re: Dataframe fails for large resultsize

2016-04-29 Thread Buntu Dev
Just to provide more details, I have 200 blocks (parquet files) with avg block size of 70M. When limiting the result set to 100k ("select * from tbl order by c1 limit 10") works but when increasing it to say 1M I keep running into this error: Connection reset by peer: socket write error I

how to orderBy previous groupBy.count.orderBy

2016-04-29 Thread Brent S. Elmer Ph.D.
I have the following simple example that I can't get to work correctly. In [1]: from pyspark.sql import SQLContext, Row from pyspark.sql.types import StructType, StructField, IntegerType, StringType from pyspark.sql.functions import asc, desc, sum, count sqlContext = SQLContext(sc) error_schema

Re: Various Apache Spark's deployment problems

2016-04-29 Thread Robineast
Do you need 2 --num-executors ? Sent from my iPhone > On 29 Apr 2016, at 20:25, Ashish Sharma [via Apache Spark User List] > wrote: > > Submit Command1: > > spark-submit --class working.path.to.Main \ > --master yarn \ >

Re: What is the default value of rebalance.backoff.ms in Spark Kafka Direct?

2016-04-29 Thread swetha kasireddy
OK. Thanks Cody! On Fri, Apr 29, 2016 at 12:41 PM, Cody Koeninger wrote: > If worker to broker communication breaks down, the worker will sleep > for refresh.leader.backoff.ms before throwing an error, at which point > normal spark task retry (spark.task.maxFailures) comes

Re: What is the default value of rebalance.backoff.ms in Spark Kafka Direct?

2016-04-29 Thread Cody Koeninger
If worker to broker communication breaks down, the worker will sleep for refresh.leader.backoff.ms before throwing an error, at which point normal spark task retry (spark.task.maxFailures) comes into play. If driver to broker communication breaks down, the driver will sleep for

Re: Sanely updating parquet partitions.

2016-04-29 Thread Philip Weaver
Hello, I think I may have jumped to the wrong conclusion about symlinks, and I was able to get what I want working perfectly. I added these two settings in my importer application: sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Re: aggregateByKey - external combine function

2016-04-29 Thread Nirav Patel
Any thoughts? I can explain more on problem but basically shuffle data doesn't seem to fit in reducer memory (32GB) and I am looking ways to process them on disk+memory. Thanks On Thu, Apr 28, 2016 at 10:07 AM, Nirav Patel wrote: > Hi, > > I tried to convert a

Sanely updating parquet partitions.

2016-04-29 Thread Philip Weaver
Hello, I have a parquet dataset, partitioned by a column 'a'. I want to take advantage of Spark SQL's ability to filter to the partition when you filter on 'a'. I also want to periodically update individual partitions without disrupting any jobs that are querying the data. The obvious solution

Re: (YARN CLUSTER MODE) Where to find logs within Spark RDD processing function ?

2016-04-29 Thread nguyen duc tuan
what does the WebUI show? What do you see when you click on "stderr" and "stdout" links ? These links must contain stdoutput and stderr for each executor. About your custom logging in executor, are you sure you checked "${spark. yarn.app.container.log.dir}/spark-app.log" Actual location of this

Re: Spark support for Complex Event Processing (CEP)

2016-04-29 Thread Michael Segel
If you’re getting the logs, then it really isn’t CEP unless you consider the event to be the log from the bus. This doesn’t sound like there is a time constraint. Your bus schedule is fairly fixed and changes infrequently. Your bus stops are relatively fixed points. (Within a couple of

Re: What is the default value of rebalance.backoff.ms in Spark Kafka Direct?

2016-04-29 Thread swetha kasireddy
OK. So the Kafka side they use rebalance.backoff.ms of 2000 which is a default for rebalancing and they say that refresh.leader.backoff.ms of 200 to refresh leader is very aggressive and suggested us to increase it to 2000. Even after increasing to 2500 I still get Leader Lost Errors. Is

Bit(N) on create Table with MSSQLServer

2016-04-29 Thread Andrés Ivaldi
Hello, Spark is executing a create table sentence (using JDBC) to MSSQLServer with a mapping column type like ColName Bit(1) for boolean types, This create table cannot be executed on MSSQLServer. In class JdbcDialect the mapping for Boolean type is Bit(1), so the question is, this is a problem

Re: What is the default value of rebalance.backoff.ms in Spark Kafka Direct?

2016-04-29 Thread swetha kasireddy
OK. So the Kafka side they use rebalance.backoff.ms of 2000 which is a default for rebalancing and they say that refresh.leader.backoff.ms of 200 to refresh leader is very aggressive and suggested us to increase it to 2000. Even after increasing to 2500 I still get Leader Lost Errors. Is

Re: (YARN CLUSTER MODE) Where to find logs within Spark RDD processing function ?

2016-04-29 Thread dev loper
Hi Ted & Nguyen, @Ted , I was under the belief that if the log4j.properties file would be taken from the application classpath if file path is not specified. Please correct me if I am wrong. I tried your approach as well still I couldn't find the logs. @nguyen I am running it on a Yarn cluster

Re: (YARN CLUSTER MODE) Where to find logs within Spark RDD processing function ?

2016-04-29 Thread nguyen duc tuan
These are executor's logs, not the driver logs. To see this log files, you have to go to executor machines where tasks is running. To see what you will print to stdout or stderr you can either go to the executor machines directly (will store in "stdout" and "stderr" files somewhere in the executor

(YARN CLUSTER MODE) Where to find logs within Spark RDD processing function ?

2016-04-29 Thread dev loper
Hi Spark Team, I have asked the same question on stack overflow , no luck yet. http://stackoverflow.com/questions/36923949/where-to-find-logs-within-spark-rdd-processing-function-yarn-cluster-mode?noredirect=1#comment61419406_36923949 I am running my Spark Application on Yarn Cluster. No

Re: Spark support for Complex Event Processing (CEP)

2016-04-29 Thread Mich Talebzadeh
ok like any work you need to start this from a simple model. take one bus only (identified by bus number which is unique). for any bus no N you have two logs LOG A and LOG B and LOG C the coordinator from Central computer that sends estimated time of arrival (ETA) to the bus stops. Pretty simple.

Re: Spark on AWS

2016-04-29 Thread Steve Loughran
On 28 Apr 2016, at 22:59, Alexander Pivovarov > wrote: Spark works well with S3 (read and write). However it's recommended to set spark.speculation true (it's expected that some tasks fail if you read large S3 folder, so speculation should

Re: Spark support for Complex Event Processing (CEP)

2016-04-29 Thread Esa Heikkinen
Hi I try to explain my case .. Situation is not so simple in my logs and solution. There also many types of logs and there are from many sources. They are as csv-format and header line includes names of the columns. This is simplified description of input logs. LOG A's: bus coordinate logs

Re: Create multiple output files from Thriftserver

2016-04-29 Thread Mich Talebzadeh
Hi, Two points here. 1) Beeline uses JDBC connection to connect to hive server. You can actually put your code in an hql file and run it as beeline -u jdbc:hive2://:10010/default org.apache.hive.jdbc.HiveDriver -n hduser -p x -f mycode,hql 2) How do you want to split your output file. Is

Re: [Spark 1.5.2]All data being written to only one part file rest part files are empty

2016-04-29 Thread Divya Gehlot
Hi , I observed if I use subset of same dataset or data set is small its writing to many part files . If data set grows its writing to only part files rest all part files empty. Thanks, Divya On 25 April 2016 at 23:15, nguyen duc tuan wrote: > Maybe the problem is the