Re: Writing files to s3 with out temporary directory

2017-11-21 Thread Jim Carroll
I got it working. It's much faster. If someone else wants to try it I: 1) Was already using the code from the Presto S3 Hadoop FileSystem implementation modified to sever it from the rest of the Presto codebase. 2) I extended it and overrode the method "keyFromPath" so that anytime the Path

Re: Writing files to s3 with out temporary directory

2017-11-21 Thread Jim Carroll
It's not actually that tough. We already use a custom Hadoop FileSystem for S3 because when we started using Spark with S3 the native FileSystem was very unreliable. Our's is based on the code from Presto. (see

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread Jim Carroll
Thanks. In the meantime I might just write a custom file system that maps writes to parquet file parts to their final locations and then skips the move. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread Jim Carroll
I have this exact issue. I was going to intercept the call in the filesystem if I had to (since we're using the S3 filesystem from Presto anyway) but if there's simply a way to do this correctly I'd much prefer it. This basically doubles the time to write parquet files to s3. -- Sent from:

Spark on mesos in docker not getting parameters

2016-08-09 Thread Jim Carroll
I'm running spark 2.0.0 on Mesos using spark.mesos.executor.docker.image to point to a docker container that I built with the Spark installation. Everything is working except the Spark client process that's started inside the container doesn't get any of my parameters I set in the spark config in

Non-classification neural networks

2016-03-27 Thread Jim Carroll
Hello all, We were using the old "Artificial Neural Network" : https://github.com/apache/spark/pull/1290 This code appears to have been incorporated in 1.5.2 but it's only exposed publicly via the MultilayerPerceptronClassifier. Is there a way to use the old feedforward/backprop

Re: EOFException using KryoSerializer

2015-06-24 Thread Jim Carroll
I finally got back to this and I just wanted to let anyone that runs into this know that the problem is a kryo version issue. Spark (at least 1.4.0) depends on Kryo 2.21 while my client had 2.24.0 on the classpath. Changing it to 2.21 fixed the problem. -- View this message in context:

EOFException using KryoSerializer

2015-05-19 Thread Jim Carroll
I'm seeing the following exception ONLY when I run on a Mesos cluster. If I run the exact same code with master set to local[N] I have no problem: 2015-05-19 16:45:43,484 [task-result-getter-0] WARN TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0, 10.253.1.101): java.io.EOFException

Re: 1.3 Hadoop File System problem

2015-03-25 Thread Jim Carroll
Thanks Patrick and Michael for your responses. For anyone else that runs across this problem prior to 1.3.1 being released, I've been pointed to this Jira ticket that's scheduled for 1.3.1: https://issues.apache.org/jira/browse/SPARK-6351 Thanks again. -- View this message in context:

1.3 Hadoop File System problem

2015-03-24 Thread Jim Carroll
I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to find the s3 hadoop file system. I get the java.lang.IllegalArgumentException: Wrong FS: s3://path to my file], expected: file:/// when I try to save a parquet file. This worked in 1.2.1. Has anyone else seen this? I'm

Parquet divide by zero

2015-01-28 Thread Jim Carroll
Hello all, I've been hitting a divide by zero error in Parquet though Spark detailed (and fixed) here: https://github.com/apache/incubator-parquet-mr/pull/102 Is anyone else hitting this error? I hit it frequently. It looks like the Parquet team is preparing to release 1.6.0 and, since they

Cluster Aware Custom RDD

2015-01-16 Thread Jim Carroll
Hello all, I have a custom RDD for fast loading of data from a non-partitioned source. The partitioning happens in the RDD implementation by pushing data from the source into queues picked up by the current active partitions in worker threads. This works great on a multi-threaded single host

Re: No disk single pass RDD aggregation

2014-12-18 Thread Jim Carroll
Hi, This was all my fault. It turned out I had a line of code buried in a library that did a repartition. I used this library to wrap an RDD to present it to legacy code as a different interface. That's what was causing the data to spill to disk. The really stupid thing is it took me the better

No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll
Okay, I have an rdd that I want to run an aggregate over but it insists on spilling to disk even though I structured the processing to only require a single pass. In other words, I can do all of my processing one entry in the rdd at a time without persisting anything. I set

Re: No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll
In case a little more information is helpful: the RDD is constructed using sc.textFile(fileUri) where the fileUri is to a .gz file (that's too big to fit on my disk). I do an rdd.persist(StorageLevel.NONE) and it seems to have no affect. This rdd is what I'm calling aggregate on and I expect to

Re: No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll
Nvm. I'm going to post another question since this has to do with the way spark handles sc.textFile with a file://.gz -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20725.html Sent from the Apache Spark User

Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Jim Carroll
Is there a way to get Spark to NOT reparition/shuffle/expand a sc.textFile(fileUri) when the URI is a gzipped file? Expanding a gzipped file should be thought of as a transformation and not an action (if the analogy is apt). There is no need to fully create and fill out an intermediate RDD with

How do I stop the automatic partitioning of my RDD?

2014-12-16 Thread Jim Carroll
I've been trying to figure out how to use Spark to do a simple aggregation without reparitioning and essentially creating fully instantiated intermediate RDDs and it seem virtually impossible. I've now gone as far as writing my own single parition RDD that wraps an Iterator[String] and calling

Re: How do I stop the automatic partitioning of my RDD?

2014-12-16 Thread Jim Carroll
Wow. i just realized what was happening and it's all my fault. I have a library method that I wrote that presents the RDD and I was actually repartitioning it myself. I feel pretty dumb. Sorry about that. -- View this message in context:

Standard SQL tool access to SchemaRDD

2014-12-02 Thread Jim Carroll
Hello all, Is there a way to load an RDD in a small driver app and connect with a JDBC client and issue SQL queries against it? It seems the thrift server only works with pre-existing Hive tables. Thanks Jim -- View this message in context:

Re: Standard SQL tool access to SchemaRDD

2014-12-02 Thread Jim Carroll
Thanks! I'll give it a try. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-tp20197p20202.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Set worker log configuration when running local[n]

2014-11-14 Thread Jim Carroll
How do I set the log level when running local[n]? It ignores the log4j.properties file on my classpath. I also tried to set the spark home dir on the SparkConfig using setSparkHome and made sure an appropriate log4j.properties file was in a conf subdirectory and that didn't work either. I'm

Re: Set worker log configuration when running local[n]

2014-11-14 Thread Jim Carroll
Actually, it looks like it's Parquet logging that I don't have control over. For some reason the parquet project decided to use java.util logging with its own logging configuration. -- View this message in context:

How do I turn off Parquet logging in a worker?

2014-11-14 Thread Jim Carroll
I'm running a local spark master (local[n]). I cannot seem to turn off the parquet logging. I tried: 1) Setting a log4j.properties on the classpath. 2) Setting a log4j.properties file in a spark install conf directory and pointing to the install using setSparkHome 3) Editing the

Re: How do I turn off Parquet logging in a worker?

2014-11-14 Thread Jim Carroll
This is a problem because (other than the fact that Parquet uses java.util.logging) of a bug in Spark in the current master. ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else

Re: Set worker log configuration when running local[n]

2014-11-14 Thread Jim Carroll
Just to be complete, this is a problem in Spark that I worked around and detailed here: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-turn-off-Parquet-logging-in-a-worker-td18955.html -- View this message in context:

Re: How do I turn off Parquet logging in a worker?

2014-11-14 Thread Jim Carroll
Jira: https://issues.apache.org/jira/browse/SPARK-4412 PR: https://github.com/apache/spark/pull/3271 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-turn-off-Parquet-logging-in-a-worker-tp18955p18977.html Sent from the Apache Spark User List

Wildly varying aggregate performance depending on code location

2014-11-12 Thread Jim Carroll
Hello all, I have a really strange thing going on. I have a test data set with 500K lines in a gzipped csv file. I have an array of column processors, one for each column in the dataset. A Processor tracks aggregate state and has a method process(v : String) I'm calling: val processors:

Re: Wildly varying aggregate performance depending on code location

2014-11-12 Thread Jim Carroll
Well it looks like this is a scala problem after all. I loaded the file using pure scala and ran the exact same Processors without Spark and I got 20 seconds (with the code in the same file as the 'main') vs 30 seconds (with the exact same code in a different file) on the 500K rows. -- View

Ending a job early

2014-10-28 Thread Jim Carroll
We have some very large datasets where the calculation converge on a result. Our current implementation allows us to track how quickly the calculations are converging and end the processing early. This can significantly speed up some of our processing. Is there a way to do the same thing is

Re: Network requirements between Driver, Master, and Slave

2014-09-12 Thread Jim Carroll
Hi Akhil, Thanks! I guess in short that means the master (or slaves?) connect back to the driver. This seems like a really odd way to work given the driver needs to already connect to the master on port 7077. I would have thought that if the driver could initiate a connection to the master, that

Network requirements between Driver, Master, and Slave

2014-09-11 Thread Jim Carroll
Hello all, I'm trying to run a Driver on my local network with a deployment on EC2 and it's not working. I was wondering if either the master or slave instances (in standalone) connect back to the driver program. I outlined the details of my observations in a previous post but here is what I'm

Re: Querying a parquet file in s3 with an ec2 install

2014-09-09 Thread Jim Carroll
My apologies to the list. I replied to Manu's question and it went directly to him rather than the list. In case anyone else has this issue here is my reply and Manu's reply to me. This also answers Ian's question. --- Hi Manu, The dataset is 7.5 million

Re: Querying a parquet file in s3 with an ec2 install

2014-09-09 Thread Jim Carroll
Why I think its the number of files is that I believe that a all of those or large part of those files are read when you run sqlContext.parquetFile() and the time it would take in s3 for that to happen is a lot so something internally is timing out.. I'll create the parquet files with Drill

Re: Querying a parquet file in s3 with an ec2 install

2014-09-09 Thread Jim Carroll
Okay, This seems to be either a code version issue or a communication issue. It works if I execute the spark shell from the master node. It doesn't work if I run it from my laptop and connect to the master node. I had opened the ports for the WebUI (8080) and the cluster manager (7077) for the

Querying a parquet file in s3 with an ec2 install

2014-09-08 Thread Jim Carroll
Hello all, I've been wrestling with this problem all day and any suggestions would be greatly appreciated. I'm trying to test reading a parquet file that's stored in s3 using a spark cluster deployed on ec2. The following works in the spark shell when run completely locally on my own machine

Using Spark to add data to an existing Parquet file without a schema

2014-09-04 Thread Jim Carroll
Hello all, I've been trying to figure out how to add data to an existing Parquet file without having a schema. Spark has allowed me to load JSON and save it as a Parquet file but I was wondering if anyone knows how to ADD/INSERT more data. I tried using sql insert and that doesn't work. All of

Re: Using Spark to add data to an existing Parquet file without a schema

2014-09-04 Thread Jim Carroll
Okay, Obviously I don't care about adding more files to the system so is there a way to point to an existing parquet file (directory) and seed the individual part-r-***.parquet (the value of partition + offset) while preventing I mean, I can hack it by copying files into the same parquet

Continuously running non-streaming jobs

2014-04-17 Thread Jim Carroll
Is there a way to create continuously-running, or at least continuously-loaded, jobs that can be 'invoked' rather than 'sent' to to avoid the job creation overhead of a couple seconds? I read through the following:

Re: Continuously running non-streaming jobs

2014-04-17 Thread Jim Carroll
Daniel, I'm new to Spark but I thought that thread hinted at the right answer. Thanks, Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Continuously-running-non-streaming-jobs-tp4391p4397.html Sent from the Apache Spark User List mailing list archive