Re: parquet vs orc files

2018-02-21 Thread Kane Kim
ut in a test. This highly depends > on the data and the analysis you want to do. > > > On 21. Feb 2018, at 21:54, Kane Kim <kane.ist...@gmail.com> wrote: > > > > Hello, > > > > Which format is better supported in spark, parquet or orc? > > Will spark use i

parquet vs orc files

2018-02-21 Thread Kane Kim
Hello, Which format is better supported in spark, parquet or orc? Will spark use internal sorting of parquet/orc files (and how to test that)? Can spark save sorted parquet/orc files? Thanks!

Re: spark, reading from s3

2015-02-12 Thread Kane Kim
it as: telnet s3.amazonaws.com 80 GET / HTTP/1.0 [image: Inline image 1] Thanks Best Regards On Wed, Feb 11, 2015 at 6:43 AM, Kane Kim kane.ist...@gmail.com wrote: I'm getting this warning when using s3 input: 15/02/11 00:58:37 WARN RestStorageService: Adjusted time offset in response

Re: spark, reading from s3

2015-02-12 Thread Kane Kim
it is skewed. cheers On Fri, Feb 13, 2015 at 5:51 AM, Kane Kim kane.ist...@gmail.com wrote: The thing is that my time is perfectly valid... On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its with the timezone actually, you can either use an NTP to maintain accurate

spark python exception

2015-02-10 Thread Kane Kim
sometimes I'm getting this exception: Traceback (most recent call last): File /opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/daemon.py, line 162, in manager code = worker(sock) File /opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/daemon.py, line 64, in worker outfile.flush() IOError:

spark, reading from s3

2015-02-10 Thread Kane Kim
I'm getting this warning when using s3 input: 15/02/11 00:58:37 WARN RestStorageService: Adjusted time offset in response to RequestTimeTooSkewed error. Local machine and S3 server disagree on the time by approximately 0 seconds. Retrying connection. After that there are tons of 403/forbidden

Re: python api and gzip compression

2015-02-09 Thread Kane Kim
Found it - used saveAsHadoopFile On Mon, Feb 9, 2015 at 9:11 AM, Kane Kim kane.ist...@gmail.com wrote: Hi, how to compress output with gzip using python api? Thanks!

Re: pyspark - gzip output compression

2015-02-05 Thread Kane Kim
I'm getting SequenceFile doesn't work with GzipCodec without native-hadoop code! Where to get those libs and where to put it in the spark? Also can I save plain text file (like saveAsTextFile) as gzip? Thanks. On Wed, Feb 4, 2015 at 11:10 PM, Kane Kim kane.ist...@gmail.com wrote: How to save

Re: spark on ec2

2015-02-05 Thread Kane Kim
cluster and got odd results for stopping the workers (no workers found) but the start script... seemed to work. My integration cluster was running and functioning after executing both scripts, but I also didn't make any changes to spark-env either. On Thu Feb 05 2015 at 9:49:49 PM Kane Kim

spark driver behind firewall

2015-02-05 Thread Kane Kim
I submit spark job from machine behind firewall, I can't open any incoming connections to that box, does driver absolutely need to accept incoming connections? Is there any workaround for that case? Thanks.

spark on ec2

2015-02-05 Thread Kane Kim
Hi, I'm trying to change setting as described here: http://spark.apache.org/docs/1.2.0/ec2-scripts.html export SPARK_WORKER_CORES=6 Then I ran ~/spark-ec2/copy-dir /root/spark/conf to distribute to slaves, but without any effect. Do I have to restart workers? How to do that with spark-ec2?

pyspark - gzip output compression

2015-02-04 Thread Kane Kim
How to save RDD with gzip compression? Thanks.

Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Kane Kim
I'm trying to process a large dataset, mapping/filtering works ok, but as long as I try to reduceByKey, I get out of memory errors: http://pastebin.com/70M5d0Bn Any ideas how I can fix that? Thanks. - To unsubscribe, e-mail:

processing large dataset

2015-01-22 Thread Kane Kim
I'm trying to process 5TB of data, not doing anything fancy, just map/filter and reduceByKey. Spent whole day today trying to get it processed, but never succeeded. I've tried to deploy to ec2 with the script provided with spark on pretty beefy machines (100 r3.2xlarge nodes). Really frustrated

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-20 Thread Kane Kim
Related question - is execution of different stages optimized? I.e. map followed by a filter will require 2 loops or they will be combined into single one? On Tue, Jan 20, 2015 at 4:33 AM, Bob Tiernay btier...@hotmail.com wrote: I found the following to be a good discussion of the same topic:

spark java options

2015-01-16 Thread Kane Kim
I want to add some java options when submitting application: --conf spark.executor.extraJavaOptions=-XX:+UnlockCommercialFeatures -XX:+FlightRecorder But looks like it doesn't get set. Where I can add it to make it working? Thanks.