Spark Streaming stateful operation to HBase

2016-06-08 Thread soumick dasgupta
Hi, I am using mapwithstate to keep the state and then ouput the result to HBase. The problem I am facing is when there are no files arriving, the RDD is still emitting the previous state result due to the checkpoint. Is there a way I can restrict not to write that result to HBase, i.e., when the

Re: Spark Partition by Columns doesn't work properly

2016-06-08 Thread Jasleen Kaur
The github repo is https://github.com/datastax/spark-cassandra-connector The talk video and slides should be uploaded soon on spark summit website On Wednesday, June 8, 2016, Chanh Le wrote: > Thanks, I'll look into it. Any luck to get link related to. > > On Thu, Jun 9,

Re: Spark Partition by Columns doesn't work properly

2016-06-08 Thread Chanh Le
Thanks, I'll look into it. Any luck to get link related to. On Thu, Jun 9, 2016, 12:43 PM Jasleen Kaur wrote: > Try using the datastax package. There was a great talk on spark summit > about it. It will take care of the boiler plate code and you can focus on > real

Re: Spark Partition by Columns doesn't work properly

2016-06-08 Thread Jasleen Kaur
Try using the datastax package. There was a great talk on spark summit about it. It will take care of the boiler plate code and you can focus on real business value On Wednesday, June 8, 2016, Chanh Le wrote: > Hi everyone, > I tested the partition by columns of data frame

Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-08 Thread Daniel Haviv
Hi, I've set these properties both in core-site.xml and hdfs-site.xml with no luck. Thank you. Daniel > On 9 Jun 2016, at 01:11, Steve Loughran wrote: > > >> On 8 Jun 2016, at 16:34, Daniel Haviv >> wrote: >> >> Hi, >> I'm trying to

Spark Partition by Columns doesn't work properly

2016-06-08 Thread Chanh Le
Hi everyone, I tested the partition by columns of data frame but it’s not good I mean wrong. I am using Spark 1.6.1 load data from Cassandra. I repartition by 2 field date, network_id - 200 partitions I reparation by 1 field date - 200 partitions. but my data is data of 90 days -> I mean if we

Re: UnsupportedOperationException: converting from RDD to DataSets on 1.6.1

2016-06-08 Thread Koert Kuipers
You can try passing in an explicit encoder: org.apache.spark.sql.Encoders.kryo[Set[com.wix.accord.Violation]] Although this might only be available in spark 2, i don't remember top of my head... On Wed, Jun 8, 2016 at 11:57 PM, Koert Kuipers wrote: > Sets are not supported.

Re: UnsupportedOperationException: converting from RDD to DataSets on 1.6.1

2016-06-08 Thread Koert Kuipers
Sets are not supported. you basically need to stick to products (tuples, case classes), Seq and Map (and in spark 2 also Option). Or you can need to resort to the kryo-based encoder. On Wed, Jun 8, 2016 at 3:45 PM, Peter Halliday wrote: > I have some code that was producing

Re: [ Standalone Spark Cluster ] - Track node status

2016-06-08 Thread Mich Talebzadeh
The other way is to log in to the individual nodes and do jps 24819 Worker And you Processes identified as worker Also you can use jmonitor to see what they are doing resource wise You can of course write a small shell script to see if Worker(s) are up and running in every node and alert if

Re: [ Standalone Spark Cluster ] - Track node status

2016-06-08 Thread Rutuja Kulkarni
Thank you for the quick response. So the workers section would list all the running worker nodes in the standalone Spark cluster? I was also wondering if this is the only way to retrieve worker nodes or is there something like a Web API or CLI I could use? Thanks. Regards, Rutuja On Wed, Jun 8,

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
Hi, Just to clarify I use Hive with Spark engine (default) so Hive on Spark engine as we discussed and observed. Now with regard to Spark (as an app NOT execution engine) doing the create table in Hive and populating it, I don't think Spark itself does any transactional enforcement. This means

Spark 2.0 Streaming and Event Time

2016-06-08 Thread Chang Lim
Hi All, Does Spark 2.0 Streaming [sqlContext.read.format(...).stream(...)] support Event Time? In TD's Spark Summit talk yesterday, this is listed as a 2.0 feature. Of so, where is the API or how to set it? Thanks in advanced, Chang -- View this message in context:

Re: [ Standalone Spark Cluster ] - Track node status

2016-06-08 Thread Mich Talebzadeh
check port 8080 on the node that you started start-master.sh [image: Inline images 2] HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

[ Standalone Spark Cluster ] - Track node status

2016-06-08 Thread Rutuja Kulkarni
Hello! I'm trying to setup a standalone spark cluster and wondering how to track status of all of it's nodes. I wonder if something like Yarn REST API or HDFS CLI exists in Spark world that can provide status of nodes on such a cluster. Any pointers would be greatly appreciated. -- *Regards,*

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
OK this seems to work. 1. Create the target table first 2. Populate afterwards I first created the target table with hive> create table test.dummy as select * from oraclehadoop.dummy where 1 = 2; Then did INSERT/SELECT and tried to drop the target table when DML (INSERT/SELECT) was

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Michael Segel
> On Jun 8, 2016, at 3:35 PM, Eugene Koifman wrote: > > if you split “create table test.dummy as select * from oraclehadoop.dummy;” > into create table statement, followed by insert into test.dummy as select… > you should see the behavior you expect with Hive. > Drop

Re: Write Ahead Log

2016-06-08 Thread Mohit Anchlia
Is there any specific reason why this feature is only supported in streaming? On Wed, Jun 8, 2016 at 3:24 PM, Ted Yu wrote: > There was a minor typo in the name of the config: > > spark.streaming.receiver.writeAheadLog.enable > > Yes, it only applies to Streaming. > > On

Re: Write Ahead Log

2016-06-08 Thread Ted Yu
There was a minor typo in the name of the config: spark.streaming.receiver.writeAheadLog.enable Yes, it only applies to Streaming. On Wed, Jun 8, 2016 at 3:14 PM, Mohit Anchlia wrote: > Is something similar to park.streaming.receiver.writeAheadLog.enable > available on

Write Ahead Log

2016-06-08 Thread Mohit Anchlia
Is something similar to park.streaming.receiver.writeAheadLog.enable available on SparkContext? It looks like it only works for spark streaming.

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
Hive version is 2 We can discuss all sorts of scenarios. However, Hivek is pretty good at applying the locks at both the table and partition level. The idea of having a metadata is to enforce these rules. [image: Inline images 1] For example above inserting from source to target table

Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-08 Thread Steve Loughran
On 8 Jun 2016, at 16:34, Daniel Haviv > wrote: Hi, I'm trying to create a table on s3a but I keep hitting the following error: Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException:

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Michael Segel
Doh! It would help if I use the email address to send to the list… Hi, Lets take a step back… Which version of Hive? Hive recently added transaction support so you have to know your isolation level. Also are you running spark as your execution engine, or are you talking about a spark

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
Hi, The idea of accessing Hive metada is to be aware of concurrency. In generall if I do the following In Hive hive> create table test.dummy as select * from oraclehadoop.dummy; We can see that hive applies the locks in Hive [image: Inline images 2] However, there seems to be an

Variable in UpdateStateByKey Not Updating After Restarting Application from Checkpoint

2016-06-08 Thread Joe Panciera
I've run into an issue where a global variable used within an UpdateStateByKey function isn't being assigned after the application restarts from a checkpoint. Using ForEachRDD I have a global variable 'A' that is propagated from a file every time a batch runs, and A is then used in an

UDTRegistration

2016-06-08 Thread pgrandjean
Hi, I discovered the following scala object on the master branch: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UDTRegistration.scala It is currently a private object. In which Apache Spark version is it planned to be released as public?

UnsupportedOperationException: converting from RDD to DataSets on 1.6.1

2016-06-08 Thread Peter Halliday
I have some code that was producing OOM during shuffle and was RDD. So, upon direction by a member of Databricks I started covering to Datasets. However, when we did we are getting an error that seems to be not liking something within one of our case classes. Peter Halliday [2016-06-08

RE: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread David Newberger
Could you be looking at 2 jobs trying to use the same file and one getting to it before the other and finally removing it? David Newberger From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Wednesday, June 8, 2016 1:33 PM To: user; user @spark Subject: Creating a Hive table through

Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
Hi, I noticed an issue with Spark creating and populating a Hive table. The process as I see is as follows: 1. Spark creates the Hive table. In this case an ORC table in a Hive Database 2. Spark uses JDBC connection to get data out from an Oracle 3. I create a temp table in Spark

Re: When queried through hiveContext, does hive executes these queries using its execution engine (default is map-reduce), or spark just reads the data and performs those queries itself?

2016-06-08 Thread lalit sharma
To add on what Vikash said above, bit more internals : 1. There are 2 components which work together to achieve Hive + Spark integration a. HiveContext which extends SqlContext adds logic to add hive specific things e.g. loading jars to talk to underlying metastore db, load configs in

Re: Dealing with failures

2016-06-08 Thread Mohit Anchlia
On Wed, Jun 8, 2016 at 3:42 AM, Jacek Laskowski wrote: > On Wed, Jun 8, 2016 at 2:38 AM, Mohit Anchlia > wrote: > > I am looking to write an ETL job using spark that reads data from the > > source, perform transformation and insert it into the

Re: When queried through hiveContext, does hive executes these queries using its execution engine (default is map-reduce), or spark just reads the data and performs those queries itself?

2016-06-08 Thread Vikash Pareek
Himanshu, Spark doesn't use hive execution engine (Map Reduce) to execute query. Spark only reads the meta data from hive meta store db and executes the query within Spark execution engine. This meta data is used by Spark's own SQL execution engine (this includes components such as catalyst,

ChiSqSelector Selected Features Indicies

2016-06-08 Thread Sebastian Kuepers
Hi there, what is the best way to get from: pyspark.mllib.feature.ChiSqSelector(numTopFeatures) the vector indices of the selected vectors from the original input vector? Shouldn't the model contain this information? Thanks!

Re: Trainning a spark ml linear regresion model fail after migrating from 1.5.2 to 1.6.1

2016-06-08 Thread philippe v
here is a gist with the minimal code and data http://gist.github.com/anonymous/aca8ba5841404ea092f9efcc658c5d57 -- View this message in context:

HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-08 Thread Daniel Haviv
Hi, I'm trying to create a table on s3a but I keep hitting the following error: Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain) I

Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

2016-06-08 Thread Jacek Laskowski
Hi, I just noticed it today while toying with Spark 2.0.0 (today's build) that doing Seq(...).toDF does **not** submit a Spark job while sc.parallelize(Seq(...)).toDF does. I was nicely surprised and been thinking about the reason for the behaviour. My explanation was that Datasets are just a

Re: comparaing row in pyspark data frame

2016-06-08 Thread Jacek Laskowski
On Wed, Jun 8, 2016 at 2:05 PM, pseudo oduesp wrote: > how we can compare columns to get max of row not columns and get name of > columns where max it present ? First thought - a UDF. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache

When queried through hiveContext, does hive executes these queries using its execution engine (default is map-reduce), or spark just reads the data and performs those queries itself?

2016-06-08 Thread Himanshu Mehra
So what happens underneath when we query on a hive table using hiveContext? 1. Does Spark talks to metastore to get the data location on hdfs and read the data from there to perform those queries? 2. Spark passes those queries to hive and hive executes those queries on the table and returns the

RE: GraphX Java API

2016-06-08 Thread Felix Cheung
You might want to check out GraphFrames graphframes.github.io On Sun, Jun 5, 2016 at 6:40 PM -0700, "Santoshakhilesh" wrote: Ok , thanks for letting me know. Yes Since Java and scala programs ultimately runs on JVM. So the APIs written in one language can

Re: SQL JSON array operations

2016-06-08 Thread amalik
Hi @jvuillermet, I am encountering a similar problem. Did you manage to figure out parsing of complicated unstructured json files. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-JSON-array-operations-tp21164p27113.html Sent from the Apache Spark User

Re: Analyzing twitter data

2016-06-08 Thread Jörn Franke
You can directly load it into solr. But think about what you want to index etc. > On 08 Jun 2016, at 15:51, Mich Talebzadeh wrote: > > yes. use that is reasonable. > > What is the format of twitter data. Is that primarily json.? > > If I do > > duser@rhes564:

Re: Analyzing twitter data

2016-06-08 Thread Mich Talebzadeh
yes. use that is reasonable. What is the format of twitter data. Is that primarily json.? If I do *duser@rhes564: /usr/lib/nifi-0.6.1/conf> hdfs dfs -cat /twitter_data/FlumeData.1464945101915|more* 16/06/08 14:48:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your

Re: comparaing row in pyspark data frame

2016-06-08 Thread Ted Yu
Do you mean returning col3 and 0.4 for the example row below ? > On Jun 8, 2016, at 5:05 AM, pseudo oduesp wrote: > > Hi, > how we can compare multiples columns in datframe i mean > > if df it s dataframe like that : > >df.col1 | df.col2 |

Re: Analyzing twitter data

2016-06-08 Thread Jörn Franke
That is trivial to do , I did it once when they were in json format > On 08 Jun 2016, at 13:15, Mich Talebzadeh wrote: > > Interesting. There is also apache nifi > > Also I note that one can store twitter data in Hive tables as well? > > > > Dr Mich Talebzadeh >

Re: Analyzing twitter data

2016-06-08 Thread Mich Talebzadeh
Interesting. There is also apache nifi Also I note that one can store twitter data in Hive tables as well? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: oozie and spark on yarn

2016-06-08 Thread vaquar khan
Hi Karthi, Hope following information will help you. Doc: https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html Example : https://developer.ibm.com/hadoop/2015/11/05/run-spark-job-yarn-oozie/ Code :

Re: Dealing with failures

2016-06-08 Thread Jacek Laskowski
On Wed, Jun 8, 2016 at 2:38 AM, Mohit Anchlia wrote: > I am looking to write an ETL job using spark that reads data from the > source, perform transformation and insert it into the destination. Is this going to be one-time job or you want it to run every time interval? >

OneVsRest SVM - Very Low F-Measure compared to OneVsRest Logistic Regression

2016-06-08 Thread Hayri Volkan Agun
Hi, I build a transformer model for spark svm for binary classification. I basically implement the predictRaw method for classification and classification model of spark api. override def predictRaw(dataMatrix: Vector):Vector = { val m = weights.toBreeze.dot(dataMatrix.toBreeze) + intercept

Re: Trainning a spark ml linear regresion model fail after migrating from 1.5.2 to 1.6.1

2016-06-08 Thread Jacek Laskowski
Hi, Is it me only to *not* see the snippets? Could you please gist 'em => https://gist.github.com ? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Wed, Jun 8,

unsubscribe

2016-06-08 Thread Amal Babu

Trainning a spark ml linear regresion model fail after migrating from 1.5.2 to 1.6.1

2016-06-08 Thread philippe v
I use spark-ml to train a linear regression model. It worked perfectly with spark version 1.5.2 but now with 1.6.1 I get the following error : Here is a minimal code : And input.csv data the pom.xml How can I fix it ? -- View this message in context:

Re: Spark 2.0 Release Date

2016-06-08 Thread Jacek Laskowski
Whoohoo! What a great news! Looks like a RC is coming...Thanks a lot, Reynold! Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Wed, Jun 8, 2016 at 7:55 AM, Reynold

oozie and spark on yarn

2016-06-08 Thread pseudo oduesp
hi , i want ask if somone used oozie with spark ? if you can give me example: how ? we can configure on yarn thanks

Spark streaming micro batch failure handling

2016-06-08 Thread aviemzur
Hi, A question about spark streaming handling of failed micro batch. After a certain amount of task failures, there are no more retries, and the entire batch fails. What seems to happen next is that this batch is ignored and the next micro batch begins, which means not all the data has been