from:"Jon Chase"

Extremely slow shuffle writes and large job time fluxuations

2016-07-19 Thread Jon Chase

I'm running into an issue with a pyspark job where I'm sometimes seeing extremely variable job times (20min to 2hr) and very long shuffle times (e.g. ~2 minutes for 18KB/86 records). Cluster set up is Amazon EMR 4.4.0, Spark 1.6.0, an m4.2xl driver and a single m4.10xlarge (40 vCPU, 160GB)

Re: About memory leak in spark 1.4.1

2015-09-28 Thread Jon Chase

I'm seeing a similar (same?) problem on Spark 1.4.1 running on Yarn (Amazon EMR, Java 8). I'm running a Spark Streaming app 24/7 and system memory eventually gets exhausted after about 3 days and the JVM process dies with: # # There is insufficient memory for the Java Runtime Environment to

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase

I should note that the amount of data in each batch is very small, so I'm not concerned with performance implications of grouping into a single RDD. On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase jon.ch...@gmail.com wrote: I'm currently doing something like this in my Spark Streaming program (Java

Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase

I'm currently doing something like this in my Spark Streaming program (Java): dStream.foreachRDD((rdd, batchTime) - { log.info(processing RDD from batch {}, batchTime); // my rdd processing code }); Instead of having my

Re: RDD collect hangs on large input data

2015-04-07 Thread Jon Chase

Zsolt - what version of Java are you running? On Mon, Mar 30, 2015 at 7:12 AM, Zsolt Tóth toth.zsolt@gmail.com wrote: Thanks for your answer! I don't call .collect because I want to trigger the execution. I call it because I need the rdd on the driver. This is not a huge RDD and it's not

Re: Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Jon Chase

(), which always assumes the underlying SQL array is represented by a Scala Seq. Would you mind to open a JIRA ticket for this? Thanks! Cheng On 3/27/15 7:00 PM, Jon Chase wrote: Spark 1.3.0 Two issues: a) I'm unable to get a lateral view explode query to work on an array type b) I'm

Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Jon Chase

Spark 1.3.0 Two issues: a) I'm unable to get a lateral view explode query to work on an array type b) I'm unable to save an array type to a Parquet file I keep running into this: java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq Here's a stack trace from the

Re: Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Jon Chase

you mind to also provide the full stack trace of the exception thrown in the saveAsParquetFile call? Thanks! Cheng On 3/27/15 7:35 PM, Jon Chase wrote: https://issues.apache.org/jira/browse/SPARK-6570 I also left in the call to saveAsParquetFile(), as it produced a similar exception

Column not found in schema when querying partitioned table

2015-03-26 Thread Jon Chase

Spark 1.3.0, Parquet I'm having trouble referencing partition columns in my queries. In the following example, 'probeTypeId' is a partition column. For example, the directory structure looks like this: /mydata /probeTypeId=1 ...files... /probeTypeId=2 ...files... I see

Re: Column not found in schema when querying partitioned table

2015-03-26 Thread Jon Chase

I've filed this as https://issues.apache.org/jira/browse/SPARK-6554 On Thu, Mar 26, 2015 at 6:29 AM, Jon Chase jon.ch...@gmail.com wrote: Spark 1.3.0, Parquet I'm having trouble referencing partition columns in my queries. In the following example, 'probeTypeId' is a partition column

Spark SQL queries hang forever

2015-03-26 Thread Jon Chase

Spark 1.3.0 on YARN (Amazon EMR), cluster of 10 m3.2xlarge (8cpu, 30GB), executor memory 20GB, driver memory 10GB I'm using Spark SQL, mainly via spark-shell, to query 15GB of data spread out over roughly 2,000 Parquet files and my queries frequently hang. Simple queries like select count(*) from

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-24 Thread Jon Chase

Shahab - This should do the trick until Hao's changes are out: sqlContext.sql(create temporary function foobar as 'com.myco.FoobarUDAF'); sqlContext.sql(select foobar(some_column) from some_table); This works without requiring to 'deploy' a JAR with the UDAF in it - just make sure the UDAF

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Jon Chase

I've had a lot of difficulties with using the s3:// prefix. s3n:// seems to work much better. Can't find the link ATM, but seems I recall that s3:// (Hadoop's original block format for s3) is no longer recommended for use. Amazon's EMR goes so far as to remap the s3:// to s3n:// behind the

Re: JavaRDD (Data Aggregation) based on key

2014-12-23 Thread Jon Chase

Have a look at RDD.groupBy(...) and reduceByKey(...) On Tue, Dec 23, 2014 at 4:47 AM, sachin Singh sachin.sha...@gmail.com wrote: Hi, I have a csv file having fields as a,b,c . I want to do aggregation(sum,average..) based on any field(a,b or c) as per user input, using Apache Spark Java

Re: S3 files , Spark job hungsup

2014-12-23 Thread Jon Chase

http://www.jets3t.org/toolkit/configuration.html Put the following properties in a file named jets3t.properties and make sure it is available during the running of your Spark job (just place it in ~/ and pass a reference to it when calling spark-submit with --file ~/jets3t.properties)

Re: Downloads from S3 exceedingly slow when running on spark-ec2

2014-12-23 Thread Jon Chase

/ On Thu, Dec 18, 2014 at 5:56 AM, Jon Chase jon.ch...@gmail.com wrote: I'm running a very simple Spark application that downloads files from S3, does a bit of mapping, then uploads new files. Each file is roughly 2MB and is gzip'd. I was running the same code on Amazon's EMR w/Spark

Re: Fetch Failure

2014-12-19 Thread Jon Chase

I'm getting the same error (ExecutorLostFailure) - input RDD is 100k small files (~2MB each). I do a simple map, then keyBy(), and then rdd.saveAsHadoopDataset(...). Depending on the memory settings given to spark-submit, the time before the first ExecutorLostFailure varies (more memory ==

Re: Fetch Failure

2014-12-19 Thread Jon Chase

take up some memory beyond their heap size. -Sandy On Dec 19, 2014, at 9:04 AM, Jon Chase jon.ch...@gmail.com wrote: I'm getting the same error (ExecutorLostFailure) - input RDD is 100k small files (~2MB each). I do a simple map, then keyBy(), and then rdd.saveAsHadoopDataset(...). Depending

Re: Fetch Failure

2014-12-19 Thread Jon Chase

7392K, reserved 1048576K On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase jon.ch...@gmail.com wrote: I'm actually already running 1.1.1. I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no luck. Still getting ExecutorLostFailure (executor lost). On Fri, Dec 19, 2014

Re: Fetch Failure

2014-12-19 Thread Jon Chase

Yes, same problem. On Fri, Dec 19, 2014 at 11:29 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Do you hit the same errors? Is it now saying your containers are exceed ~10 GB? On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase jon.ch...@gmail.com wrote: I'm actually already running 1.1.1. I

Yarn not running as many executors as I'd like

2014-12-19 Thread Jon Chase

Running on Amazon EMR w/Yarn and Spark 1.1.1, I have trouble getting Yarn to use the number of executors that I specify in spark-submit: --num-executors 2 In a cluster with two core nodes will typically only result in one executor running at a time. I can play with the memory settings and

Downloads from S3 exceedingly slow when running on spark-ec2

2014-12-18 Thread Jon Chase

I'm running a very simple Spark application that downloads files from S3, does a bit of mapping, then uploads new files. Each file is roughly 2MB and is gzip'd. I was running the same code on Amazon's EMR w/Spark and not having any download speed issues (Amazon's EMR provides a custom

Spark cluster with Java 8 using ./spark-ec2

2014-11-25 Thread Jon Chase

I'm trying to use the spark-ec2 command to launch a Spark cluster that runs Java 8, but so far I haven't been able to get the Spark processes to use the right JVM at start up. Here's the command I use for launching the cluster. Note I'm using the user-data feature to install Java 8: ./spark-ec2

Extremely slow shuffle writes and large job time fluxuations

Re: About memory leak in spark 1.4.1

Re: Possible to combine all RDDs from a DStream batch into one?

Possible to combine all RDDs from a DStream batch into one?

Re: RDD collect hangs on large input data

Re: Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

Re: Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

Column not found in schema when querying partitioned table

Re: Column not found in schema when querying partitioned table

Spark SQL queries hang forever

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

Re: JavaRDD (Data Aggregation) based on key

Re: S3 files , Spark job hungsup

Re: Downloads from S3 exceedingly slow when running on spark-ec2

Re: Fetch Failure

Re: Fetch Failure

Re: Fetch Failure

Re: Fetch Failure

Yarn not running as many executors as I'd like

Downloads from S3 exceedingly slow when running on spark-ec2

Spark cluster with Java 8 using ./spark-ec2

23 matches

Site Navigation

Mail list logo

Footer information