Re: Filter on Grouped Data

2015-07-03 Thread Raghavendra Pandey
Why dont you apply filter first and then Group the data and run aggregations.. On Jul 3, 2015 1:29 PM, Megha Sridhar- Cynepia megha.sridh...@cynepia.com wrote: Hi, I have a Spark DataFrame object, which when trimmed, looks like, FromTo Subject

ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

2015-07-03 Thread Kostas Kougios
I have this problem with a job. A random executor gets this ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM Almost always at the same point in the processing of the data. I am processing 1 mil files with sc.wholeText. At around the 600.000th file, a container receives

Re: Streaming: updating broadcast variables

2015-07-03 Thread Raghavendra Pandey
You cannot update the broadcasted variable.. It wont get reflected on workers. On Jul 3, 2015 12:18 PM, James Cole ja...@binarism.net wrote: Hi all, I'm filtering a DStream using a function. I need to be able to change this function while the application is running (I'm polling a service to

Re: Optimizations

2015-07-03 Thread Raghavendra Pandey
This is the basic design of spark that it runs all actions in different stages... Not sure you can achieve what you r looking for. On Jul 3, 2015 12:43 PM, Marius Danciu marius.dan...@gmail.com wrote: Hi all, If I have something like: rdd.join(...).mapPartitionToPair(...) It looks like

Re: Kryo fails to serialise output

2015-07-03 Thread Will Briggs
Kryo serialization is used internally by Spark for spilling or shuffling intermediate results, not for writing out an RDD as an action. Look at Sandy Ryza's examples for some hints on how to do this: https://github.com/sryza/simplesparkavroapp Regards, Will On July 3, 2015, at 2:45 AM,

Spark-csv into labeled points with null values

2015-07-03 Thread Saif.A.Ellafi
Hello all, I am learning scala spark and going through some applications with data I have. Please allow me to put a couple questions: spark-csv: The data I have, ain't malformed, but there are empty values in some rows, properly comma-sepparated and not catched by DROPMALFORMED mode These

Re: Spark performance issue

2015-07-03 Thread Silvio Fiorito
It’ll help to see the code or at least understand what transformations you’re using. Also, you have 15 nodes but not using all of them, so that means you may be losing data locality. You can see this in the job UI for Spark if any jobs do not have node or process local. From: diplomatic Guru

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-03 Thread Ted Yu
Alternatively, setting spark.driver.extraClassPath should work. Cheers On Fri, Jul 3, 2015 at 2:59 AM, Steve Loughran ste...@hortonworks.com wrote: On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I'm trying to start the thrift-server and passing it

Spark performance issue

2015-07-03 Thread diplomatic Guru
Hello guys, I'm after some advice on Spark performance. I've a MapReduce job that read inputs carry out a simple calculation and write the results into HDFS. I've implemented the same logic in Spark job. When I tried both jobs on same datasets, I'm getting different execution time, which is

Re: Spark Streaming broadcast to all keys

2015-07-03 Thread Silvio Fiorito
updateStateByKey will run for all keys, whether they have new data in a batch or not so you should be able to still use it. On 7/3/15, 7:34 AM, micvog mich...@micvog.com wrote: UpdateStateByKey is useful but what if I want to perform an operation to all existing keys (not only the ones in

Float type coercion on SparkR with hiveContext

2015-07-03 Thread Evgeny Sinelnikov
Hello, I'm got a trouble with float type coercion on SparkR with hiveContext. result - sql(hiveContext, SELECT offset, percentage from data limit 100) show(result) DataFrame[offset:float, percentage:float] head(result) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot

SparkSQL cache table with multiple replicas

2015-07-03 Thread David Sabater Dinter
Hi all, Do you know if there is an option to specify how many replicas we want while caching in memory a table in SparkSQL Thrift server? I have not seen any option so far but I assumed there is an option as you can see in the Storage section of the UI that there is 1 x replica of your

Re: duplicate names in sql allowed?

2015-07-03 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-8817 On Fri, Jul 3, 2015 at 11:43 AM, Koert Kuipers ko...@tresata.com wrote: i see the relaxation to allow duplicate field names was done on purpose, since some data sources can have dupes due to case insensitive resolution. apparently the issue is

Experience with centralised logging for Spark?

2015-07-03 Thread Edward Sargisson
Hi all, I'm wondering if anybody as any experience with centralised logging for Spark - or even has felt that there was need for this given the WebUI. At my organization we use Log4j2 and Flume as the front end of our centralised logging system. I was looking into modifying Spark to use that

Re: build spark 1.4 source code for sparkR with maven

2015-07-03 Thread Shivaram Venkataraman
You need to add -Psparkr to build SparkR code Shivaram On Fri, Jul 3, 2015 at 2:14 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you try: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package Thanks Best Regards On Fri, Jul 3, 2015 at 2:27 PM,

Re: duplicate names in sql allowed?

2015-07-03 Thread Koert Kuipers
i see the relaxation to allow duplicate field names was done on purpose, since some data sources can have dupes due to case insensitive resolution. apparently the issue is now dealt with during query analysis. although this might work for sql it does not seem a good thing for DataFrame to me. it

Re: Spark SQL groupby timestamp

2015-07-03 Thread sim
@bastien, in those situations, I prefer to use Unix timestamps (millisecond or second granularity) because you can apply math operations to them easily. If you don't have a Unix timestamp, you can use unix_timestamp() from Hive SQL to get one with second granularity.Then doing grouping by hour

Re: Optimizations

2015-07-03 Thread Marius Danciu
Thanks for your feedback. Yes I am aware of stages design and Silvio what you are describing is essentially map-side join which is not applicable when you have both RDDs quite large. It appears that rdd.join(...).mapToPair(f) f is piggybacked inside join stage (right in the reducers I believe)

Contineous errors trying to start spark-shell

2015-07-03 Thread Mohamed Lrhazi
Hello, I am trying to just start spark-shell... it starts, the prompt appears, then a never ending (literally) stream of these log lines proceeds What is it trying to do? Why is it failing? To start it I do: $ docker run -it ncssm/spark-base /spark/bin/spark-shell --master spark://

Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread khaledh
I'm writing a Spark Streaming application that uses RabbitMQ to consume events. One feature of RabbitMQ that I intend to make use of is bulk ack of messages, i.e. no need to ack one-by-one, but only ack the last event in a batch and that would ack the entire batch. Before I commit to doing so,

Re: SparkR and Spark Mlib

2015-07-03 Thread ayan guha
No. Spark R is language binding for spark. MLlib is machine learning project on top of spark core On 4 Jul 2015 12:23, praveen S mylogi...@gmail.com wrote: Hi, Is sparkR and spark Mlib same?

SparkR and Spark Mlib

2015-07-03 Thread praveen S
Hi, Is sparkR and spark Mlib same?

Re: How to timeout a task?

2015-07-03 Thread William Ferrell
Ted, Thanks very much for your reply. It took me almost a week but I have finally had a chance to implement what you noted and it appears to be working locally. However, when I launch this onto a cluster on EC2 -- this doesn't work reliably. To expand, I think the issue is that some of the code

Re: Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread Raghavendra Pandey
I dont think you can expect any order guarantee except the records in one partition. On Jul 4, 2015 7:43 AM, khaledh khal...@gmail.com wrote: I'm writing a Spark Streaming application that uses RabbitMQ to consume events. One feature of RabbitMQ that I intend to make use of is bulk ack of

Re: Spark 1.4 MLLib Bug?: Multiclass Classification requirement failed: sizeInBytes was negative

2015-07-03 Thread Burak Yavuz
How many partitions do you have? It might be that one partition is too large, and there is Integer overflow. Could you double your number of partitions? Burak On Fri, Jul 3, 2015 at 4:41 AM, Danny kont...@dannylinden.de wrote: hi, i want to run a multiclass classification with 390 classes

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Akhil Das
With binary i think it might not be possible, although if you can download the sources and then build it then you can remove this function https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023 which initializes the SQLContext.

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Daniel Haviv
The main reason is Spark's startup time and the need to configure a component I don't really need (without configs the hivecontext takes more time to load) Thanks, Daniel On 3 ביולי 2015, at 11:13, Robin East robin.e...@xense.co.uk wrote: As Akhil mentioned there isn’t AFAIK any kind of

Optimizations

2015-07-03 Thread Marius Danciu
Hi all, If I have something like: rdd.join(...).mapPartitionToPair(...) It looks like mapPartitionToPair runs in a different stage then join. Is there a way to piggyback this computation inside the join stage ? ... such that each result partition after join is passed to the mapPartitionToPair

Re: duplicate names in sql allowed?

2015-07-03 Thread Akhil Das
I think you can open up a jira, not sure if this PR https://github.com/apache/spark/pull/2209/files (SPARK-2890 https://issues.apache.org/jira/browse/SPARK-2890) broke the validation piece. Thanks Best Regards On Fri, Jul 3, 2015 at 4:29 AM, Koert Kuipers ko...@tresata.com wrote: i am

Filter on Grouped Data

2015-07-03 Thread Megha Sridhar- Cynepia
Hi, I have a Spark DataFrame object, which when trimmed, looks like, FromTo SubjectMessage-ID karen@xyz.com['vance.me...@enron.com', SEC Inquiry 19952575.1075858 'jeannie.mandel...@enron.com',

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread ayan guha
Hivecontext should be supersets of SQL context so you should be able to perform all your tasks. Are you facing any problem with hivecontext? On 3 Jul 2015 17:33, Daniel Haviv daniel.ha...@veracity-group.com wrote: Thanks I was looking for a less hack-ish way :) Daniel On Fri, Jul 3, 2015 at

build spark 1.4 source code for sparkR with maven

2015-07-03 Thread 1106944...@qq.com
Hi all, Anyone build spark 1.4 source code for sparkR with maven/sbt, what's comand ? using sparkR must build from source code about 1.4 version . thank you 1106944...@qq.com

[spark1.4] sparkContext.stop causes exception on Mesos

2015-07-03 Thread Ayoub
Hello Spark developers, After upgrading to spark 1.4 on Mesos 0.22.1 existing code started to throw this exception when calling sparkContext.stop : (SparkListenerBus) [ERROR - org.apache.spark.Logging$class.logError(Logging.scala:96)] Listener EventLoggingListener threw an exception

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Daniel Haviv
Thanks I was looking for a less hack-ish way :) Daniel On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das ak...@sigmoidanalytics.com wrote: With binary i think it might not be possible, although if you can download the sources and then build it then you can remove this function

Spark 1.4 MLLib Bug?: Multiclass Classification requirement failed: sizeInBytes was negative

2015-07-03 Thread Danny Linden
hi, i want to run a multiclass classification with 390 classes on120k label points(tf-idf vectors). but i get the following exception. If i reduce the number of classes to ~20 everythings work fine. How can i fix this? i use the LogisticRegressionWithLBFGS class for my classification on a 8

Kryo fails to serialise output

2015-07-03 Thread Dominik Hübner
I have a rather simple avro schema to serialize Tweets (message, username, timestamp). Kryo and twitter chill are used to do so. For my dev environment the Spark context is configured as below val conf: SparkConf = new SparkConf() conf.setAppName(kryo_test) conf.setMaster(“local[4])

Re: Spark Dataframe 1.4 (GroupBy partial match)

2015-07-03 Thread Suraj Shetiya
Hi Salih, Thanks for the links :) This seems very promising to me. When do you think this would be available in the spark codeline ? Thanks, Suraj On Fri, Jul 3, 2015 at 2:02 AM, Salih Oztop soz...@yahoo.com wrote: Hi Suraj, It seems your requirement is Record Linkage/Entity Resolution.

Streaming: updating broadcast variables

2015-07-03 Thread James Cole
Hi all, I'm filtering a DStream using a function. I need to be able to change this function while the application is running (I'm polling a service to see if a user has changed their filtering). The filter function is a transformation and runs on the workers, so that's where the updates need to

Re: Accessing the console from spark

2015-07-03 Thread Jem Tucker
In the driver when running spark-submit with --master yarn-client On Fri, Jul 3, 2015 at 10:23 AM Akhil Das ak...@sigmoidanalytics.com wrote: Where does it returns null? Within the driver or in the executor? I just tried System.console.readPassword in spark-shell and it worked. Thanks Best

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-03 Thread Steve Loughran
On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv daniel.ha...@veracity-group.commailto:daniel.ha...@veracity-group.com wrote: Hi, I'm trying to start the thrift-server and passing it azure's blob storage jars but I'm failing on : Caused by: java.io.IOException: No FileSystem for scheme: wasb

Re: Accessing the console from spark

2015-07-03 Thread Akhil Das
Can you paste the code? Something is missing Thanks Best Regards On Fri, Jul 3, 2015 at 3:14 PM, Jem Tucker jem.tuc...@gmail.com wrote: In the driver when running spark-submit with --master yarn-client On Fri, Jul 3, 2015 at 10:23 AM Akhil Das ak...@sigmoidanalytics.com wrote: Where does

Multiple Join Conditions in dataframe join

2015-07-03 Thread bipin
Hi, I need to join with multiple conditions. Can anyone tell how to specify that. For e.g. this is what I am trying to do : val Lead_all = Leads. | join(Utm_Master, Leaddetails.columns(LeadSource,Utm_Source,Utm_Medium,Utm_Campaign) ==

Accessing the console from spark

2015-07-03 Thread Jem Tucker
Hi, We have an application that requires a username/password to be entered from the command line. To screen a password in java you need to use System.console.readPassword however when running with spark System.console returns null?? Any ideas on how to get the console from spark? Thanks, Jem

Re: build spark 1.4 source code for sparkR with maven

2015-07-03 Thread Akhil Das
Did you try: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package Thanks Best Regards On Fri, Jul 3, 2015 at 2:27 PM, 1106944...@qq.com 1106944...@qq.com wrote: Hi all, Anyone build spark 1.4 source code for sparkR with maven/sbt, what's comand ? using

Re: Accessing the console from spark

2015-07-03 Thread Akhil Das
Where does it returns null? Within the driver or in the executor? I just tried System.console.readPassword in spark-shell and it worked. Thanks Best Regards On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker jem.tuc...@gmail.com wrote: Hi, We have an application that requires a username/password to

Re: Accessing the console from spark

2015-07-03 Thread Jem Tucker
I have shown two senarios below: // setup spark context val user = readLine(username: ) val pass = System.console.readPassword(password: ) - null pointer exception here and // setup spark context val user = readLine(username: ) val console = System.console - null pointer exception here

Spark Streaming broadcast to all keys

2015-07-03 Thread micvog
UpdateStateByKey is useful but what if I want to perform an operation to all existing keys (not only the ones in this RDD). Word count for example - is there a way to decrease *all* words seen so far by 1? I was thinking of keeping a static class per node with the count information and issuing a

Spark 1.4 MLLib Bug?: Multiclass Classification requirement failed: sizeInBytes was negative

2015-07-03 Thread Danny
hi, i want to run a multiclass classification with 390 classes on120k label points(tf-idf vectors). but i get the following exception. If i reduce the number of classes to ~20 everythings work fine. How can i fix this? i use the LogisticRegressionWithLBFGS class for my classification on a 8