Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-10 Thread Jacek Laskowski
"Something like that" I've never tried it out myself so I'm only guessing having a brief look at the API. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sat,

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-10 Thread Egor Pahomov
Jacek, so I create cache in ForeachWriter, in all "process()" I write to it and on close I flush? Something like that? 2017-02-09 12:42 GMT-08:00 Jacek Laskowski : > Hi, > > Yes, that's ForeachWriter. > > Yes, it works with element by element. You're looking for mapPartition >

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread nguyen duc Tuan
Hi Nick, Because we use *RandomSignProjectionLSH*, there is only one parameter for LSH is the number of hashes. I try with small number of hashes (2) but the error is still happens. And it happens when I call similarity join. After transformation, the size of dataset is about 4G. 2017-02-11 3:07

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Nick Pentreath
What other params are you using for the lsh transformer? Are the issues occurring during transform or during the similarity join? On Fri, 10 Feb 2017 at 05:46, nguyen duc Tuan wrote: > hi Das, > In general, I will apply them to larger datasets, so I want to use LSH, >

Re: Strange behavior with 'not' and filter pushdown

2017-02-10 Thread Everett Anderson
Bumping this thread. Translating "where not(username is not null)" into a filter of [IsNotNull(username), Not(IsNotNull(username))] seems like a rather severe bug. Spark 1.6.2: explain select count(*) from parquet_table where not( username is not null) == Physical Plan ==

Getting exit code of pipe()

2017-02-10 Thread Xuchen Yao
Hello Community, I have the following Python code that calls an external command: rdd.pipe('run.sh', env=os.environ).collect() run.sh can either exit with status 1 or 0, how could I get the exit code from Python? Thanks! Xuchen

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-10 Thread Cosmin Posteuca
Thank you very much for your answers, Now i understand better what i have to do! Thank you! On Wed, 8 Feb 2017 at 22:37, Gourav Sengupta wrote: > Hi, > > I am not quite sure of your used case here, but I would use spark-submit > and submit sequential jobs as steps to

Re: Driver hung and happend out of memory while writing to console progress bar

2017-02-10 Thread Ryan Blue
This isn't related to the progress bar, it just happened while in that section of code. Something else is taking memory in the driver, usually a broadcast table or something else that requires a lot of memory and happens on the driver. You should check your driver memory settings and the query

SQL warehouse dir

2017-02-10 Thread Joseph Naegele
Hi all, I've read the docs for Spark SQL 2.1.0 but I'm still having issues with the warehouse and related details. I'm not using Hive proper, so my hive-site.xml consists only of: javax.jdo.option.ConnectionURL jdbc:derby:;databaseName=/mnt/data/spark/metastore_db;create=true I've set

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread nguyen duc Tuan
hi Das, In general, I will apply them to larger datasets, so I want to use LSH, which is more scaleable than the approaches as you suggested. Have you tried LSH in Spark 2.1.0 before ? If yes, how do you set the parameters/configuration to make it work ? Thanks. 2017-02-10 19:21 GMT+07:00

Re: Practical configuration to run LSH in Spark 2.1.0

2017-02-10 Thread Debasish Das
If it is 7m rows and 700k features (or say 1m features) brute force row similarity will run fine as well...check out spark-4823...you can compare quality with approximate variant... On Feb 9, 2017 2:55 AM, "nguyen duc Tuan" wrote: > Hi everyone, > Since spark 2.1.0

HDFS Shell tool

2017-02-10 Thread Vitásek , Ladislav
Hello Spark fans, I would like to inform you about our tool we want to share in big data community. I think it can be also handy for Spark users. We created a new utility - HDFS Shell to work with HDFS data more easily. https://github.com/avast/hdfs-shell *Feature highlights* - HDFS DFS command

Write JavaDStream to Kafka (how?)

2017-02-10 Thread Gutwein, Sebastian
Hi, I'am new to Spark-Streaming and want to run some end-to-end-tests with Spark and Kafka. My program is running but at the kafka topic nothing arrives. Can someone please help me? Where is my mistake, has someone a runnig example of writing a DStream to Kafka 0.10.1.0? The program looks

Add hive-site.xml at runtime

2017-02-10 Thread Shivam Sharma
Hi, I have multiple hive configurations(hive-site.xml) and because of that only I am not able to add any hive configuration in spark *conf* directory. I want to add this configuration file at start of any *spark-submit* or *spark-shell*. This conf file is huge so *--conf* is not a option for me.

Re: Add hive-site.xml at runtime

2017-02-10 Thread Shivam Sharma
Did anybody get above mail? Thanks On Fri, Feb 10, 2017 at 11:51 AM, Shivam Sharma <28shivamsha...@gmail.com> wrote: > Hi, > > I have multiple hive configurations(hive-site.xml) and because of that > only I am not able to add any hive configuration in spark *conf* directory. > I want to add