Re: Union of DStream and RDD

2017-02-11 Thread Amit Sela
Not specifically, I want to generally be able to union any form of DStream/RDD. I'm working on Apache Beam's Spark runner so the abstraction their does not tell between streaming/batch (kinda like Dataset API). Since I wrote my own InputDStream I will simply stream any "batch source" instead,

Re: Union of DStream and RDD

2017-02-11 Thread Egor Pahomov
Interestingly, I just faced with the same problem. By any change, do you want to process old files in the directory as well as new ones? It's my motivation and checkpointing my problem as well. 2017-02-08 22:02 GMT-08:00 Amit Sela : > Not with checkpointing. > > On Thu, Feb

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-11 Thread Egor Pahomov
Got it, thanks! 2017-02-11 0:56 GMT-08:00 Sam Elamin : > Here's a link to the thread > > http://apache-spark-developers-list.1001551.n3.nabble.com/Structured- > Streaming-Dropping-Duplicates-td20884.html > > On Sat, 11 Feb 2017 at 08:47, Sam Elamin

Remove dependence on HDFS

2017-02-11 Thread Benjamin Kim
Has anyone got some advice on how to remove the reliance on HDFS for storing persistent data. We have an on-premise Spark cluster. It seems like a waste of resources to keep adding nodes because of a lack of storage space only. I would rather add more powerful nodes due to the lack of

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-11 Thread Timur Shenkao
Hello, 1) Are you sure that your data is "clean"? No unexpected missing values? No strings in unusual encoding? No additional or missing columns ? 2) How long does your job run? What about garbage collector parameters? Have you checked what happens with jconsole / jvisualvm ? Sincerely yours,

Re: Strange behavior with 'not' and filter pushdown

2017-02-11 Thread Everett Anderson
On the plus side, looks like this may be fixed in 2.1.0: == Physical Plan == *HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)]) +- *Project +- *Filter NOT isnotnull(username#14) +-

Case class with POJO - encoder issues

2017-02-11 Thread Jason White
I'd like to create a Dataset using some classes from Geotools to do some geospatial analysis. In particular, I'm trying to use Spark to distribute the work based on ID and label fields that I extract from the polygon data. My simplified case class looks like this: implicit val geometryEncoder:

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Henry Tremblay
51,000 files at about 1/2 MB per file. I am wondering if I need this http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html Although if I am understanding you correctly, even if I copy the S3 files to HDFS on EMR, and use wholeTextFiles, I am still only going to be able to

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Jörn Franke
Can you post more information about the number of files, their size and the executor logs. A gzipped file is not splittable i.e. Only one executor can gunzip it (the unzipped data can then be processed in parallel). Wholetextfile was designed to be executed only on one executor (e.g. For

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Paul Tremblay
I've been working on this problem for several days (I am doing more to increase my knowledge of Spark). The code you linked to hangs because after reading in the file, I have to gunzip it. Another way that seems to be working is reading each file in using sc.textFile, and then writing it the

Re: Turning rows into columns

2017-02-11 Thread Paul Tremblay
Yes, that's what I need. Thanks. P. On 02/05/2017 12:17 PM, Koert Kuipers wrote: since there is no key to group by and assemble records i would suggest to write this in RDD land and then convert to data frame. you can use sc.wholeTextFiles to process text files and create a state machine

Re: Getting exit code of pipe()

2017-02-11 Thread Felix Cheung
Do you want the job to fail if there is an error exit code? You could set checkCode to True spark.apache.org/docs/latest/api/python/pyspark.html?highlight=pipe#pyspark.RDD.pipe Otherwise maybe you want

Disable Spark SQL Optimizations for unit tests

2017-02-11 Thread Stefan Ackermann
Hi, Can the Spark SQL Optimizations be disabled somehow? In our project we started 4 weeks ago to write scala / spark / dataframe code. We currently have only around 10% of the planned project scope, and we are already waiting 10 (Spark 2.1.0, everything cached) to 30 (Spark 1.6, nothing cached)

Re: From C* to DataFrames with JSON

2017-02-11 Thread Takeshi Yamamuro
If you upgrade to v2.1, you can use to_json/from_json in sql.functions. On Fri, Feb 10, 2017 at 3:12 PM, Jean-Francois Gosselin < jfgosse...@gmail.com> wrote: > > Hi all, > > I'm struggling (Spark / Scala newbie) to create a DataFrame from a C* > table but also create a DataFrame from column

Re: EC2 script is missing in Spark 2.0.0~2.1.0

2017-02-11 Thread Md. Rezaul Karim
Thanks for the great help. Appreciated! Regards, _ *Md. Rezaul Karim*, BSc, MSc PhD Researcher, INSIGHT Centre for Data Analytics National University of Ireland, Galway IDA Business Park, Dangan, Galway, Ireland Web: http://www.reza-analytics.eu/index.html

Re: EC2 script is missing in Spark 2.0.0~2.1.0

2017-02-11 Thread Takeshi Yamamuro
Moved to https://github.com/amplab/spark-ec2. Yea, I think the script just was moved there, so you can use it in the same way. On Sat, Feb 11, 2017 at 9:59 PM, Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > Hi Takeshi, > > Now I understand that spark-ec2 script was moved to

Re: EC2 script is missing in Spark 2.0.0~2.1.0

2017-02-11 Thread Md. Rezaul Karim
Hi Takeshi, Now I understand that spark-ec2 script was moved to AMPLab. How could I use that one i.e. new location/URL, please? Alternatively, can I use the same script provided with prior Spark releases? Regards, _ *Md. Rezaul Karim*, BSc, MSc PhD Researcher,

Re: EC2 script is missing in Spark 2.0.0~2.1.0

2017-02-11 Thread Takeshi Yamamuro
Hi, Have you checked this? https://issues.apache.org/jira/browse/SPARK-12735 // maropu On Sat, Feb 11, 2017 at 9:34 PM, Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > Dear Spark Users, > > I was wondering why the EC2 script is missing in Spark release > 2.0.0.~2.1.0? Is there any

EC2 script is missing in Spark 2.0.0~2.1.0

2017-02-11 Thread Md. Rezaul Karim
Dear Spark Users, I was wondering why the EC2 script is missing in Spark release 2.0.0.~2.1.0? Is there any specific reason for that? Please note that I have chosen the package type: Pre-built for Hadoop 2.7 and later for Spark 2.1.0 for example. Am I doing something wrong? Regards,

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-11 Thread Sam Elamin
Here's a link to the thread http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-Dropping-Duplicates-td20884.html On Sat, 11 Feb 2017 at 08:47, Sam Elamin wrote: > Hey Egor > > > You can use for each writer or you can write a custom sink. I

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-11 Thread Sam Elamin
Hey Egor You can use for each writer or you can write a custom sink. I personally went with a custom sink since I get a dataframe per batch https://github.com/samelamin/spark-bigquery/blob/master/src/main/scala/com/samelamin/spark/bigquery/streaming/BigQuerySink.scala You can have a look at