[PySpark 2.1.0] - SparkContext not properly initialized by SparkConf

2017-01-25 Thread Sidney Feiner
Hey, I'm pasting a question I asked on Stack Overflow without getting any answers(:() I hope somebody here knows the answer, thanks in advance! Link to post I'm migrating from Spark 1.6

Ingesting Large csv File to relational database

2017-01-25 Thread Eric Dain
Hi, I need to write nightly job that ingest large csv files (~15GB each) and add/update/delete the changed rows to relational database. If a row is identical to what in the database, I don't want to re-write the row to the database. Also, if same item comes from multiple sources (files) I need

do I need to run spark standalone master with supervisord?

2017-01-25 Thread kant kodali
Do I need to run spark standalone master with a process supervisor such as supervisord or systemd? Does spark standalone master aborts itself if zookeeper tells it is not a master anymore? Thanks!

Re: Dataframe fails to save to MySQL table in spark app, but succeeds in spark shell

2017-01-25 Thread Takeshi Yamamuro
Hi, I'm not familiar with MyQL though, your MySQL got the same query in the two patterns? Have you checked MySQL logs? // maropu On Thu, Jan 26, 2017 at 12:42 PM, Xuan Dzung Doan < doanxuand...@yahoo.com.invalid> wrote: > Hi, > > Spark version 2.1.0 > MySQL community server version 5.7.17 >

Re: Issue returning Map from UDAF

2017-01-25 Thread Takeshi Yamamuro
Hi, Quickly looking around the attached, I found you wrongly passed the dataType of your aggregator output in line70. So, you need to at lease return `MapType` instead of `StructType`. The stacktrace you showed explicitly say this type unmatch. // maropu On Thu, Jan 26, 2017 at 12:07 PM, Ankur

Dataframe fails to save to MySQL table in spark app, but succeeds in spark shell

2017-01-25 Thread Xuan Dzung Doan
Hi, Spark version 2.1.0 MySQL community server version 5.7.17 MySQL Connector Java 5.1.40 I need to save a dataframe to a MySQL table. In spark shell, the following statement succeeds: scala> df.write.mode(SaveMode.Append).format("jdbc").option("url",

Re: Catalyst Expression(s) - Cleanup

2017-01-25 Thread Takeshi Yamamuro
Hi, if you mean clean-up in executors, how about using TaskContext#addTaskCompletionListener? // maropu On Thu, Jan 26, 2017 at 3:13 AM, Bowden, Chris wrote: > Is there currently any way to receive a signal when an Expression will no > longer receive any rows so internal

Re: How do I dynamically add nodes to spark standalone cluster and be able to discover them?

2017-01-25 Thread Raghavendra Pandey
When you start a slave you pass address of master as a parameter. That slave will contact master and register itself. On Jan 25, 2017 4:12 AM, "kant kodali" wrote: > Hi, > > How do I dynamically add nodes to spark standalone cluster and be able to > discover them? Does

Issue returning Map from UDAF

2017-01-25 Thread Ankur Srivastava
Hi, I have a dataset with tuple of ID and Timestamp. I want to do a group by on ID and then create a map with frequency per hour for the ID. Input: 1| 20160106061005 1| 20160106061515 1| 20160106064010 1| 20160106050402 1| 20160106040101 2| 20160106040101 3| 20160106051451 Expected Output:

Re: freeing up memory occupied by processed Stream Blocks

2017-01-25 Thread Takeshi Yamamuro
AFAIK spark has no public APIs to clean up those RDDs. On Wed, Jan 25, 2017 at 11:30 PM, Andrew Milkowski wrote: > Hi Takeshi thanks for the answer, looks like spark would free up old RDD's > however using admin UI we see ie > > Block ID, it corresponds with each receiver

Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu
Does the storage handler provide bulk load capability ? Cheers > On Jan 25, 2017, at 3:39 AM, Amrit Jangid wrote: > > Hi chetan, > > If you just need HBase Data into Hive, You can use Hive EXTERNAL TABLE with > STORED BY

Question about DStreamCheckpointData

2017-01-25 Thread Nikhil Goyal
Hi, I am using DStreamCheckpointData and it seems that spark checkpoints data even if the rdd processing fails. It seems to checkpoint at the moment it creates the rdd rather than waiting till its completion. Anybody knows how to make it wait till completion? Thanks Nikhil

SparkML 1.6 GBTRegressor crashes with high maxIter hyper-parameter?

2017-01-25 Thread Aris
When I train a GBTRegressor model from a DataFrame in the latest 1.6.4-Snapshot, with a high number for the hyper-parameter maxIter, say 500, we have java.lang.StackOverflowError; GBTRegressor does work with maxIter set about 100. Does this make sense? Are there any known solutions? This is

Re: where is mapWithState executed?

2017-01-25 Thread shyla deshpande
After more reading, I know the state is distributed across the cluster. But If I need to lookup a map in the updatefunction, I need to broadcast it. Just want to make sure I am on the right path. Appreciate your help. Thanks On Wed, Jan 25, 2017 at 2:33 PM, shyla deshpande

where is mapWithState executed?

2017-01-25 Thread shyla deshpande
Is it executed on the driver or executor. If I need to lookup a map in the updatefunction, I need to broadcast it, if mapWithState executed runs on executor. Thanks

can we plz open up encoder on dataset

2017-01-25 Thread Koert Kuipers
i often run into problems like this: i need to write a Dataset[T] => Dataset[T], and inside i need to switch to DataFrame for a particular operation. but if i do: dataset.toDF.map(...).as[T] i get error: Unable to find encoder for type stored in a Dataset. i know it has an encoder, because i

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread shyla deshpande
Processing of the same data more than once can happen only when the app recovers after failure or during upgrade. So how do I apply your 2nd solution only for 1-2 hrs after restart. On Wed, Jan 25, 2017 at 12:51 PM, shyla deshpande wrote: > Thanks Burak. I do want

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread shyla deshpande
Thanks Burak. I do want accuracy, that is why I want to make it idempotent. I will try out your 2nd solution. On Wed, Jan 25, 2017 at 12:27 PM, Burak Yavuz wrote: > Yes you may. Depends on if you want exact values or if you're okay with > approximations. With Big Data,

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
Yes you may. Depends on if you want exact values or if you're okay with approximations. With Big Data, generally you would be okay with approximations. Try both out, see what scales/works with your dataset. Maybe you may handle the second implementation. On Wed, Jan 25, 2017 at 12:23 PM, shyla

Drop Partition Fails

2017-01-25 Thread Subacini Balakrishnan
Hi, When we execute drop partition command on hive external table from spark-shell we are getting below error.Same command works fine from hive shell. It is a table with just two records Spark Version : 1.5.2 scala> hiveCtx.sql("select * from spark_2_test").collect().foreach(println);

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread shyla deshpande
Thanks Burak. But with BloomFilter, won't I be getting a false poisitve? On Wed, Jan 25, 2017 at 11:28 AM, Burak Yavuz wrote: > I noticed that 1 wouldn't be a problem, because you'll save the > BloomFilter in the state. > > For 2, you would keep a Map of UUID's to the

[ML - Beginner - How To] - GaussianMixtureModel and GaussianMixtureModel$

2017-01-25 Thread Saulo Ricci
Hi, I'm studying the Java implementation code of the ml library, and I'd like to know why there is 2 implementations of GaussianMixtureModel - #1 GaussianMixtureModel and #2 GaussianMixtureModel$. I appreciate the answers. Thank you, Saulo

Re: spark intermediate data fills up the disk

2017-01-25 Thread kant kodali
oh sorry its actually in the documentation. I should just set spark.worker.cleanup.enabled = true On Wed, Jan 25, 2017 at 11:30 AM, kant kodali wrote: > I have bunch of .index and .data files like that fills up my disk. I am > not sure what the fix is? I am running spark

Decompressing Spark Eventlogs with snappy (*.snappy)

2017-01-25 Thread satishl
Our spark job eventlogs are stored in compressed .snappy format. What do I need to do to decompress these files programmatically? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Decompressing-Spark-Eventlogs-with-snappy-snappy-tp28340.html Sent from the

spark intermediate data fills up the disk

2017-01-25 Thread kant kodali
I have bunch of .index and .data files like that fills up my disk. I am not sure what the fix is? I am running spark 2.0.2 in stand alone mode Thanks!

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
I noticed that 1 wouldn't be a problem, because you'll save the BloomFilter in the state. For 2, you would keep a Map of UUID's to the timestamp of when you saw them. If the UUID exists in the map, then you wouldn't increase the count. If the timestamp of a UUID expires, you would remove it from

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread shyla deshpande
In the previous email you gave me 2 solutions 1. Bloom filter --> problem in repopulating the bloom filter on restarts 2. keeping the state of the unique ids Please elaborate on 2. On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz wrote: > I don't have any sample code, but on a

Spark Summit East in Boston ‒ 20% off Code

2017-01-25 Thread Scott walent
*There’s less than two weeks to go until Spark Summit East 2017, happening February 7-9 at the Hynes Convention Center in downtown Boston. It will be the largest Spark Summit conference ever held on the East Coast, and we hope to see you there. Sign up at https://spark-summit.org/east-2017

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
I don't have any sample code, but on a high level: My state would be: (Long, BloomFilter[UUID]) In the update function, my value will be the UUID of the record, since the word itself is the key. I'll ask my BloomFilter if I've seen this UUID before. If not increase count, also add to Filter.

Re: printSchema showing incorrect datatype?

2017-01-25 Thread Koert Kuipers
should we change "def schema" to show the materialized schema? On Wed, Jan 25, 2017 at 1:04 PM, Michael Armbrust wrote: > Encoders are just an object based view on a Dataset. Until you actually > materialize and object, they are not used and thus will not change the >

Catalyst Expression(s) - Cleanup

2017-01-25 Thread Bowden, Chris
Is there currently any way to receive a signal when an Expression will no longer receive any rows so internal resources can be cleaned up? I have seen Generators are allowed to terminate() but my Expression(s) do not need to emit 0..N rows.

Re: printSchema showing incorrect datatype?

2017-01-25 Thread Michael Armbrust
Encoders are just an object based view on a Dataset. Until you actually materialize and object, they are not used and thus will not change the schema of the dataframe. On Tue, Jan 24, 2017 at 8:28 AM, Koert Kuipers wrote: > scala> val x = Seq("a", "b").toDF("x") > x:

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread shyla deshpande
Hi Burak, Thanks for the response. Can you please elaborate on your idea of storing the state of the unique ids. Do you have any sample code or links I can refer to. Thanks On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz wrote: > Off the top of my head... (Each may have it's own

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
Off the top of my head... (Each may have it's own issues) If upstream you add a uniqueId to all your records, then you may use a BloomFilter to approximate if you've seen a row before. The problem I can see with that approach is how to repopulate the bloom filter on restarts. If you are certain

Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu
The references are vendor specific. Suggest contacting vendor's mailing list for your PR. My initial interpretation of HBase repository is that of Apache. Cheers On Wed, Jan 25, 2017 at 7:38 AM, Chetan Khatri wrote: > @Ted Yu, Correct but HBase-Spark module

Re: Failure handling

2017-01-25 Thread Erwan ALLAIN
I agree We are try catching streamingcontext.awaittermination and When exception occurs we stop the streaming context and system.exit (50)(equal to SparkUnhandledCode) Sounds ok. On Tuesday, January 24, 2017, Cody Koeninger wrote: > Can you identify the error case and call

Re: HBaseContext with Spark

2017-01-25 Thread Chetan Khatri
@Ted Yu, Correct but HBase-Spark module available at HBase repository seems too old and written code is not optimized yet, I have been already submitted PR for the same. I dont know if it is clearly mentioned that now it is part of HBase itself then people are committing to older repo where

Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu
Though no hbase release has the hbase-spark module, you can find the backport patch on HBASE-14160 (for Spark 1.6) You can build the hbase-spark module yourself. Cheers On Wed, Jan 25, 2017 at 3:32 AM, Chetan Khatri wrote: > Hello Spark Community Folks, > >

Re: freeing up memory occupied by processed Stream Blocks

2017-01-25 Thread Andrew Milkowski
Hi Takeshi thanks for the answer, looks like spark would free up old RDD's however using admin UI we see ie Block ID, it corresponds with each receiver and a timestamp. For example, block input-0-1485275695898 is from receiver 0 and it was created at 1485275695898 (1/24/2017, 11:34:55 AM

Java heap error during matrix multiplication

2017-01-25 Thread Petr Shestov
Hi all! I'm using Spark 2.0.1 with two workers (one executor each) with 20Gb each. And run following code: JavaRDD entries = ...; // filing the dataCoordinateMatrix cmatrix = new CoordinateMatrix(entries.rdd());BlockMatrix matrix = cmatrix.toBlockMatrix(100, 1000);BlockMatrix cooc =

Re: HBaseContext with Spark

2017-01-25 Thread Amrit Jangid
Hi chetan, If you just need HBase Data into Hive, You can use Hive EXTERNAL TABLE with STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'. Try this if you problem can be solved https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration Regards Amrit . On Wed, Jan 25,

HBaseContext with Spark

2017-01-25 Thread Chetan Khatri
Hello Spark Community Folks, Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk Load from Hbase to Hive. I have seen couple of good example at HBase Github Repo: https://github.com/ apache/hbase/tree/master/hbase-spark If I would like to use HBaseContext with HBase 1.2.4,

Re: Scala Developers

2017-01-25 Thread Sean Owen
Yes, job postings are strongly discouraged on ASF lists, if not outright disallowed. You will see, sometimes posts prefixed with [JOBS] that are tolerated, but here I would assume they are not. This particular project and list is so big that there is no job posting I can imagine that is relevant

Re: Scala Developers

2017-01-25 Thread Hyukjin Kwon
Just as a subscriber in this mailing list, I don't want to recieve job recruiting emails and even make some efforts to set a filter for it. I don't know the policy in details but I feel inappropriate to send them where, in my experience, Spark users usually ask some questions and discuss about

is it possible to read .mdb file in spark

2017-01-25 Thread Selvam Raman
-- Selvam Raman "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Scala Developers

2017-01-25 Thread marcos rebelo
Hy all, I’m looking for Scala Developers willing to work on Berlin. We are working with Spark, AWS (the latest product are being prototyped StepFunctions, Batch Service, and old services Lambda Function, DynamoDB, ...) and building Data Products (JSON REST endpoints). We are responsible to chose