Re: SparkSQL Hive orc snappy table

2015-12-30 Thread Dawid Wysakowicz
ask can read and process > unsplittable data - versus many tasks spread across the cluster. > > On Wed, Dec 30, 2015 at 6:45 AM, Dawid Wysakowicz < > wysakowicz.da...@gmail.com> wrote: > >> Didn't anyone used spark with orc and snappy compression? >> >> 2015-12-

Re: SparkSQL Hive orc snappy table

2015-12-30 Thread Dawid Wysakowicz
Didn't anyone used spark with orc and snappy compression? 2015-12-29 18:25 GMT+01:00 Dawid Wysakowicz <wysakowicz.da...@gmail.com>: > Hi, > > I have a table in hive stored as orc with compression = snappy. I try to > execute a query on that table that fails (previously I run it

SparkSQL Hive orc snappy table

2015-12-29 Thread Dawid Wysakowicz
on that matter. Regards Dawid Wysakowicz

Re: submit_spark_job_to_YARN

2015-08-30 Thread Dawid Wysakowicz
Hi Ajay, In short story: No, there is no easy way to do that. But if you'd like to play around this topic a good starting point would be this blog post from sequenceIQ: blog http://blog.sequenceiq.com/blog/2014/08/22/spark-submit-in-java/. I heard rumors that there are some work going on to

Re: SparkSQL concerning materials

2015-08-21 Thread Dawid Wysakowicz
at 7:50 AM, Muhammad Atif muhammadatif...@gmail.com wrote: Hi Dawid The best pace to get started is the Spark SQL Guide from Apache http://spark.apache.org/docs/latest/sql-programming-guide.html Regards Muhammad On Thu, Aug 20, 2015 at 5:46 AM, Dawid Wysakowicz wysakowicz.da

SparkSQL concerning materials

2015-08-20 Thread Dawid Wysakowicz
Hi, I would like to dip into SparkSQL. Get to know better the architecture, good practices, some internals. Could you advise me some materials on this matter? Regards Dawid

Re: Spark return key value pair

2015-08-19 Thread Dawid Wysakowicz
I am not 100% sure but probably flatMap unwinds the tuples. Try with map instead. 2015-08-19 13:10 GMT+02:00 Jerry OELoo oylje...@gmail.com: Hi. I want to parse a file and return a key-value pair with pySpark, but result is strange to me. the test.sql is a big fie and each line is usename

Re: Regarding rdd.collect()

2015-08-18 Thread Dawid Wysakowicz
No, the data is not stored between two jobs. But it is stored for a lifetime of a job. Job can have multiple actions run. For a matter of sharing an rdd between jobs you can have a look at Spark Job Server(spark-jobserver https://github.com/ooyala/spark-jobserver) or some In-Memory storages:

Fwd: Using unserializable classes in tasks

2015-08-14 Thread Dawid Wysakowicz
-- Forwarded message -- From: Dawid Wysakowicz wysakowicz.da...@gmail.com Date: 2015-08-14 9:32 GMT+02:00 Subject: Re: Using unserializable classes in tasks To: mark manwoodv...@googlemail.com I am not an expert but first of all check if there is no ready connector (you mentioned

Re: Using unserializable classes in tasks

2015-08-14 Thread Dawid Wysakowicz
, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote: I am not an expert but first of all check if there is no ready connector (you mentioned Cassandra - check: spark-cassandra-connector https://github.com/datastax/spark-cassandra-connector ). If you really want to do sth on your own all