Re: Shutdown with streaming driver running in cluster broke master web UI permanently

2015-06-22 Thread scar scar
Sorry I was on vacation for a few days. Yes, it is on. This is what I have in the logs: 15/06/22 10:44:00 INFO ClientCnxn: Unable to read additional data from server sessionid 0x14dd82e22f70ef1, likely server has closed socket, closing socket connection and attempting reconnect 15/06/22 10:44:00

Re: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread Silvio Fiorito
Yes, just put the Cassandra connector on the Spark classpath and set the connector config properties in the interpreter settings. From: Mohammed Guller Date: Monday, June 22, 2015 at 11:56 AM To: Matthew Johnson, shahid ashraf Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: RE:

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread Mohammed Guller
I haven’t tried using Zeppelin with Spark on Cassandra, so can’t say for sure, but it should not be difficult. Mohammed From: Matthew Johnson [mailto:matt.john...@algomi.com] Sent: Monday, June 22, 2015 2:15 AM To: Mohammed Guller; shahid ashraf Cc: user@spark.apache.org Subject: RE: Code

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread Matthew Johnson
Hi Pawan, Looking at the changes for that git pull request, it looks like it just pulls in the dependency (and transitives) for “spark-cassandra-connector”. Since I am having to build Zeppelin myself anyway, would it be ok to just add this myself for the connector for 1.4.0 (as found here

Help optimising Spark SQL query

2015-06-22 Thread James Aley
Hello, A colleague of mine ran the following Spark SQL query: select count(*) as uses, count (distinct cast(id as string)) as users from usage_events where from_unixtime(cast(timestamp_millis/1000 as bigint)) between '2015-06-09' and '2015-06-16' The table contains billions of rows, but

RE: Help optimising Spark SQL query

2015-06-22 Thread Matthew Johnson
Hi James, What version of Spark are you using? In Spark 1.2.2 I had an issue where Spark would report a job as complete but I couldn’t find my results anywhere – I just assumed it was me doing something wrong as I am still quite new to Spark. However, since upgrading to 1.4.0 I have not seen

Support for Windowing and Analytics functions in Spark SQL

2015-06-22 Thread Sourav Mazumder
Hi, Though the documentation does not explicitly mention support for Windowing and Analytics function in Spark SQL, looks like it is not supported. I tried running a query like Select Lead(column name, 1) over (Partition By column name order by column name) from table name and I got error saying

Re: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread pawan kumar
Hi, Zeppelin has a cassandra-spark-connector built into the build. I have not tried it yet may be you could let us know. https://github.com/apache/incubator-zeppelin/pull/79 To build a Zeppelin version with the *Datastax Spark/Cassandra connector

Re: Multiple executors writing file using java filewriter

2015-06-22 Thread anshu shukla
Running perfectly in local system but not writing to file in cluster mode .ANY suggestions please .. //msgid is long counter JavaDStreamString newinputStream=inputStream.map(new FunctionString, String() { @Override public String call(String v1) throws Exception { String

Re: Web UI vs History Server Bugs

2015-06-22 Thread Jonathon Cai
No, what I'm seeing is that while the cluster is running, I can't see the app info after the app is completed. That is to say, when I click on the application name on master:8080, no info is shown. However, when I examine the same file on the History Server, the application information opens fine.

Re: Registering custom metrics

2015-06-22 Thread Dmitry Goldenberg
Great, thank you, Silvio. In your experience, is there any way to instument a callback into Coda Hale or the Spark consumers from the metrics sink? If the sink performs some steps once it has received the metrics, I'd like to be able to make the consumers aware of that via some sort of a

Re: Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-22 Thread Nitin kak
Any response to this guys? On Fri, Jun 19, 2015 at 2:34 PM, Nitin kak nitinkak...@gmail.com wrote: Any other suggestions guys? On Wed, Jun 17, 2015 at 7:54 PM, Nitin kak nitinkak...@gmail.com wrote: With Sentry, only hive user has the permission for read/write/execute on the subdirectories

Re: Does HiveContext connect to HiveServer2?

2015-06-22 Thread nitinkak001
Hey, I have exactly this question. Did you get an answer to it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200p23431.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Multiple executors writing file using java filewriter

2015-06-22 Thread anshu shukla
Can not we write some data to a txt file in parallel with multiple executors running in parallel ?? -- Thanks Regards, Anshu Shukla

Re: Task Serialization Error on DataFrame.foreachPartition

2015-06-22 Thread Ted Yu
private HTable table; You should declare table variable within apply() method. BTW which hbase release are you using ? I see you implement caching yourself. You can make use of the following HTable method: public void setWriteBufferSize(long writeBufferSize) throws

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Murthy Chelankuri
I am using spark streaming. what i am trying to do is sending few messages to some kafka topic. where its failing. java.lang.ClassNotFoundException: com.abc.mq.msg.ObjectEncoder at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at

Yarn application ID for Spark job on Yarn

2015-06-22 Thread roy
Hi, Is there a way to get Yarn application ID inside spark application, when running spark Job on YARN ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-application-ID-for-Spark-job-on-Yarn-tp23429.html Sent from the Apache Spark User List

Re: Help optimising Spark SQL query

2015-06-22 Thread Lior Chaga
Hi James, There are a few configurations that you can try: https://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options From my experience, the codegen really boost things up. Just run sqlContext.sql(spark.sql.codegen=true) before you execute your query. But keep

Re: GSSException when submitting Spark job in yarn-cluster mode with HiveContext APIs on Kerberos cluster

2015-06-22 Thread Olivier Girardot
Hi, I can't get this to work using CDH 5.4, Spark 1.4.0 in yarn cluster mode. @andrew did you manage to get it work with the latest version ? Le mar. 21 avr. 2015 à 00:02, Andrew Lee alee...@hotmail.com a écrit : Hi Marcelo, Exactly what I need to track, thanks for the JIRA pointer. Date:

Re: Help optimising Spark SQL query

2015-06-22 Thread James Aley
Thanks for the responses, guys! Sorry, I forgot to mention that I'm using Spark 1.3.0, but I'll test with 1.4.0 and try the codegen suggestion then report back. On 22 June 2015 at 12:37, Matthew Johnson matt.john...@algomi.com wrote: Hi James, What version of Spark are you using? In Spark

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Shaanan Cohney
Thanks, I've updated my code to use updateStateByKey but am still getting these errors when I resume from a checkpoint. One thought of mine was that I used sc.parallelize to generate the RDDs for the queue, but perhaps on resume, it doesn't recreate the context needed? -- Shaanan Cohney PhD

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Benjamin Fradet
Where does task_batches come from? On 22 Jun 2015 4:48 pm, Shaanan Cohney shaan...@gmail.com wrote: Thanks, I've updated my code to use updateStateByKey but am still getting these errors when I resume from a checkpoint. One thought of mine was that I used sc.parallelize to generate the RDDs

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Shaanan Cohney
It's a generated set of shell commands to run (written in C, highly optimized numerical computer), which is create from a set of user provided parameters. The snippet above is: task_outfiles_to_cmds = OrderedDict(run_sieving.leftover_tasks)

Re: Help optimising Spark SQL query

2015-06-22 Thread Ntale Lukama
Have you test this on a smaller set to verify that the query is correct? On Mon, Jun 22, 2015 at 2:59 PM, ayan guha guha.a...@gmail.com wrote: You may also want to change count(*) to specific column. On 23 Jun 2015 01:29, James Aley james.a...@swiftkey.com wrote: Hello, A colleague of mine

jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Murthy Chelankuri
I have been using the spark from the last 6 months with the version 1.2.0. I am trying to migrate to the 1.3.0 but the same problem i have written is not wokring. Its giving class not found error when i try to load some dependent jars from the main program. This use to work in 1.2.0 when set

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Akhil Das
Yes. Thanks Best Regards On Mon, Jun 22, 2015 at 8:33 PM, Murthy Chelankuri kmurt...@gmail.com wrote: I have more than one jar. can we set sc.addJar multiple times with each dependent jar ? On Mon, Jun 22, 2015 at 8:30 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Try sc.addJar instead

Calling rdd() on a DataFrame causes stage boundary

2015-06-22 Thread Alex Nastetsky
When I call rdd() on a DataFrame, it ends the current stage and starts a new one that just maps the DataFrame to rdd and nothing else. It doesn't seem to do a shuffle (which is good and expected), but then why does why is there a separate stage? I also thought that stages only end when there's a

Re: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread pawan kumar
Hi Matthew, you could add the dependencies yourself by using the %dep command in zeppelin ( https://zeppelin.incubator.apache.org/docs/interpreter/spark.html). I have not tried with zeppelin but have used spark-notebook https://github.com/andypetrella/spark-notebook and got Cassandra connector

Re: SQL vs. DataFrame API

2015-06-22 Thread Ignacio Blasco
Sorry thought it was scala/spark El 22/6/2015 9:49 p. m., Bob Corsaro rcors...@gmail.com escribió: That's invalid syntax. I'm pretty sure pyspark is using a DSL to create a query here and not actually doing an equality operation. On Mon, Jun 22, 2015 at 3:43 PM Ignacio Blasco

Spark job fails silently

2015-06-22 Thread roy
Hi, Our spark job on yarn suddenly started failing silently without showing any error following is the trace. Using properties file: /usr/lib/spark/conf/spark-defaults.conf Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer Adding default property:

workaround for groupByKey

2015-06-22 Thread Jianguo Li
Hi, I am processing an RDD of key-value pairs. The key is an user_id, and the value is an website url the user has ever visited. Since I need to know all the urls each user has visited, I am tempted to call the groupByKey on this RDD. However, since there could be millions of users and urls,

Re: workaround for groupByKey

2015-06-22 Thread ๏̯͡๏
There is reduceByKey that works on K,V. You need to accumulate partial results and proceed. does your computation allow that ? On Mon, Jun 22, 2015 at 2:12 PM, Jianguo Li flyingfromch...@gmail.com wrote: Hi, I am processing an RDD of key-value pairs. The key is an user_id, and the value is

Re: SQL vs. DataFrame API

2015-06-22 Thread Davies Liu
Right now, we can not figure out which column you referenced in `select`, if there are multiple row with the same name in the joined DataFrame (for example, two `value`). A workaround could be: numbers2 = numbers.select(df.name, df.value.alias('other')) rows = numbers.join(numbers2,

spark on yarn failing silently

2015-06-22 Thread roy
Hi, suddenly our spark job on yarn started failing silently without showing any error, following is the trace in verbose mode Using properties file: /usr/lib/spark/conf/spark-defaults.conf Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer Adding default

Re: Velox Model Server

2015-06-22 Thread Debasish Das
Models that I am looking for are mostly factorization based models (which includes both recommendation and topic modeling use-cases). For recommendation models, I need a combination of Spark SQL and ml model prediction api...I think spark job server is what I am looking for and it has fast http

Re: Submitting Spark Applications using Spark Submit

2015-06-22 Thread Andrew Or
Did you restart your master / workers? On the master node, run `sbin/stop-all.sh` followed by `sbin/start-all.sh` 2015-06-20 17:59 GMT-07:00 Raghav Shankar raghav0110...@gmail.com: Hey Andrew, I tried the following approach: I modified my Spark build on my local machine. I did downloaded

Re: Why can't I allocate more than 4 executors with 2 machines on YARN?

2015-06-22 Thread ๏̯͡๏
1) Can you try with yarn-cluster 2) Does your queue have enough capacity On Mon, Jun 22, 2015 at 11:10 AM, Saiph Kappa saiph.ka...@gmail.com wrote: Hi, I am running a simple spark streaming application on hadoop 2.7.0/YARN (master: yarn-client) cluster with 2 different machines (12GB RAM

Re: s3 - Can't make directory for path

2015-06-22 Thread Danny
hi, have you tested s3://ww-sandbox/name_of_path/ instead of s3://ww-sandbox/name_of_path or have you test to add your file extension with placeholder (*) like: s3://ww-sandbox/name_of_path/*.gz or s3://ww-sandbox/name_of_path/*.csv depend on your files. If it does not work pls test with

Re: Compute Median in Spark Dataframe

2015-06-22 Thread Deenar Toraskar
Many thanks, will look into this. I dont want to particularly reuse the custom Hive UDAF I have, would prefer writing a new one it that is cleaner. I am just using the JVM. On 5 June 2015 at 00:03, Holden Karau hol...@pigscanfly.ca wrote: My current example doesn't use a Hive UDAF, but you

which mllib algorithm for large multi-class classification?

2015-06-22 Thread Danny
hi, I am unfortunately not very fit in the whole MLlib stuff, so I would appreciate a little help: Which multi-class classification algorithm i should use if i want to train texts (100-1000 words each) into categories. The number of categories is between 100-500 and the number of training

Question about SPARK_WORKER_CORES and spark.task.cpus

2015-06-22 Thread Rui Li
Hi, I was running a WordCount application on Spark, and the machine I used has 4 physical cores. However, in spark-env.sh file, I set SPARK_WORKER_CORES = 32. The web UI says it launched one executor with 32 cores and the executor could execute 32 tasks simultaneously. Does spark create 32

Re: workaround for groupByKey

2015-06-22 Thread ๏̯͡๏
Silvio, Suppose my RDD is (K-1, v1,v2,v3,v4). If i want to do simple addition i can use reduceByKey or aggregateByKey. What if my processing needs to check all the items in the value list each time, Above two operations do not get all the values, they just get two pairs (v1, v2) , you do some

Re: Why can't I allocate more than 4 executors with 2 machines on YARN?

2015-06-22 Thread Saiph Kappa
OK, I figured out this. The maximum number of containers YARN can create per node is based on the total available RAM and the maximum allocation per container ( yarn.scheduler.maximum-allocation-mb ). The default is 8192; setting to a lower value allowed me to create more containers per node. On

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Murthy Chelankuri
Yes I have the producer in the class path. And I am using in standalone mode. Sent from my iPhone On 23-Jun-2015, at 3:31 am, Tathagata Das t...@databricks.com wrote: Do you have Kafka producer in your classpath? If so how are adding that library? Are you running on YARN, or Mesos or

Fwd: Storing an action result in HDFS

2015-06-22 Thread ravi tella
Hello All, I am new to Spark. I have a very basic question.How do I write the output of an action on a RDD to HDFS? Thanks in advance for the help. Cheers, Ravi

Re: Storing an action result in HDFS

2015-06-22 Thread ddpisfun
Hi Chris, Thanks for the quick reply and the welcome. I am trying to read a file from hdfs and then writing back just the first line to hdfs. I calling first() on the RDD to get the first line. Sent from my iPhone On Jun 22, 2015, at 7:42 PM, Chris Gore cdg...@cdgore.com wrote: Hi Ravi,

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Tathagata Das
Do you have Kafka producer in your classpath? If so how are adding that library? Are you running on YARN, or Mesos or Standalone or local. These details will be very useful. On Mon, Jun 22, 2015 at 8:34 AM, Murthy Chelankuri kmurt...@gmail.com wrote: I am using spark streaming. what i am trying

New Spark Meetup group in Munich

2015-06-22 Thread Danny Linden
Hi everyone, I want to announce that we have create a Spark Meetup Group in Munich (Germany). We currently plan our first event which will take place in July. There we will show basics about spark to catch a lot of people who are new to this framework. In the following evens we will go deeper

Re: Storing an action result in HDFS

2015-06-22 Thread Chris Gore
Hi Ravi, Welcome, you probably want RDD.saveAsTextFile(“hdfs:///my_file”) Chris On Jun 22, 2015, at 5:28 PM, ravi tella ddpis...@gmail.com wrote: Hello All, I am new to Spark. I have a very basic question.How do I write the output of an action on a RDD to HDFS? Thanks in advance

Re: workaround for groupByKey

2015-06-22 Thread Silvio Fiorito
You’re right of course, I’m sorry. I was typing before thinking about what you actually asked! On a second thought, what is the ultimate outcome for what you want the sequence of pages for? Do they need to actually all be grouped? Could you instead partition by user id then use a mapPartitions

Re: Confusion matrix for binary classification

2015-06-22 Thread Burak Yavuz
Hi, In Spark 1.4, you may use DataFrame.stat.crosstab to generate the confusion matrix. This would be very simple if you are using the ML Pipelines Api, and are working with DataFrames. Best, Burak On Mon, Jun 22, 2015 at 4:21 AM, CD Athuraliya cdathural...@gmail.com wrote: Hi, I am looking

Programming with java on spark

2015-06-22 Thread 付雅丹
Hello, everyone! I'm new in spark. I have already written programs in Hadoop2.5.2, where I defined my own InputFormat and OutputFormat. Now I want to move my codes to spark using java language. The first problem I encountered is how to transform big txt file in local storage to RDD, which is

JavaDStreamString read and write rdbms

2015-06-22 Thread Manohar753
Hi Team, How to split and put the red JavaDStreamString in to mysql in java. any existing api in sark 1.3/1.4. team can you please share the code snippet if any body have it. Thanks, Manohar -- View this message in context:

Re: JavaDStreamString read and write rdbms

2015-06-22 Thread ayan guha
Spark docs has addresses this pretty well. Look for patterns of use foreachRDD. On 22 Jun 2015 17:09, Manohar753 manohar.re...@happiestminds.com wrote: Hi Team, How to split and put the red JavaDStreamString in to mysql in java. any existing api in sark 1.3/1.4. team can you please share

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Shaanan Cohney
Counts is a list (counts = []) in the driver, used to collect the results. It seems like it's also not the best way to be doing things, but I'm new to spark and editing someone else's code so still learning. Thanks! def update_state(out_files, counts, curr_rdd): try: for c in

Serializer not switching

2015-06-22 Thread Sean Barzilay
I am trying to run a function on every line of a parquet file. The function is in an object. When I run the program, I get an exception that the object is not serializable. I read around the internet and found that I should use Kryo Serializer. I changed the setting in the spark conf and

[Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Shaanan Cohney
I'm receiving the SPARK-5063 error (RDD transformations and actions can only be invoked by the driver, not inside of other transformations) whenever I try and restore from a checkpoint in spark streaming on my app. I'm using python3 and my RDDs are inside a queuestream DStream. This is the

Re: Using Accumulators in Streaming

2015-06-22 Thread anshu shukla
But i just want to update rdd , by appending unique message ID with each element of RDD , which should be automatically(m++ ..) updated every time a new element comes to rdd . On Mon, Jun 22, 2015 at 7:05 AM, Michal Čizmazia mici...@gmail.com wrote: StreamingContext.sparkContext() On 21

Spark 1.3 - Connect to to Cassandra - cassandraTable is not recognised by sc

2015-06-22 Thread Koen Vantomme
Hello, I'm writing an application in Scala to connect to Cassandra to read the data. My setup is Intellij with maven. When I try to compile the application I get the following *error: object datastax is not a member of package com* *error: value cassandraTable is not a member of

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Akhil Das
You can use fileStream for that, look at the XMLInputFormat https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java of mahout. It should give you full XML object as on record, (as opposed to an XML

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Benjamin Fradet
What does counts refer to? Could you also paste the code of your update_state function? On 22 Jun 2015 12:48 pm, Shaanan Cohney shaan...@gmail.com wrote: I'm receiving the SPARK-5063 error (RDD transformations and actions can only be invoked by the driver, not inside of other transformations)

Re: Spark 1.3 - Connect to to Cassandra - cassandraTable is not recognised by sc

2015-06-22 Thread Koen Vantomme
its fixed now, adding dependecies in pom.xml fixed it dependency groupIdcom.datastax.spark/groupId artifactIdspark-cassandra-connector-embedded_2.10/artifactId version1.4.0-M1/version /dependency On Mon, Jun 22, 2015 at 10:46 AM, Koen Vantomme koen.vanto...@gmail.com wrote: Hello,

Re: s3 - Can't make directory for path

2015-06-22 Thread Akhil Das
Could you elaborate a bit more? What do you meant by set up a standalone server? and what is leading you to that exceptions? Thanks Best Regards On Mon, Jun 22, 2015 at 2:22 AM, nizang ni...@windward.eu wrote: hi, I'm trying to setup a standalone server, and in one of my tests, I got the

Confusion matrix for binary classification

2015-06-22 Thread CD Athuraliya
Hi, I am looking for a way to get confusion matrix for binary classification. I was able to get confusion matrix for multiclass classification using this [1]. But I could not find a proper way to get confusion matrix in similar class available for binary classification [2]. Later I found this

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Benjamin Fradet
I would suggest you have a look at the updateStateByKey transformation in the Spark Streaming programming guide which should fit your needs better than your update_state function. On 22 Jun 2015 1:03 pm, Shaanan Cohney shaan...@gmail.com wrote: Counts is a list (counts = []) in the driver, used

Re: Problem attaching to YARN

2015-06-22 Thread Steve Loughran
On 22 Jun 2015, at 04:08, Shawn Garbett shawn.garb...@gmail.commailto:shawn.garb...@gmail.com wrote: 2015-06-21 11:03:22,029 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=39288,containerID=container_1434751301309_0015_02_01]

RE: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread Matthew Johnson
Thanks Mohammed, it’s good to know I’m not alone! How easy is it to integrate Zeppelin with Spark on Cassandra? It looks like it would only support Hadoop out of the box. Is it just a case of dropping the Cassandra Connector onto the Spark classpath? Cheers, Matthew *From:* Mohammed

Re: Using Accumulators in Streaming

2015-06-22 Thread Michal Čizmazia
If I am not mistaken, one way to see the accumulators is that they are just write-only for the workers and their value can be read by the driver. Therefore they cannot be used for ID generation as you wish. On 22 June 2015 at 04:30, anshu shukla anshushuk...@gmail.com wrote: But i just want to

Re: memory needed for each executor

2015-06-22 Thread Akhil Das
Totally depends on the use-case that you are solving with Spark, for instance there was some discussion around the same which you could read over here http://apache-spark-user-list.1001560.n3.nabble.com/How-does-one-decide-no-of-executors-cores-memory-allocation-td23326.html Thanks Best Regards

Re: JavaDStreamString read and write rdbms

2015-06-22 Thread Akhil Das
Its pretty straight forward, this would get you started http://stackoverflow.com/questions/24896233/how-to-save-apache-spark-schema-output-in-mysql-database Thanks Best Regards On Mon, Jun 22, 2015 at 12:39 PM, Manohar753 manohar.re...@happiestminds.com wrote: Hi Team, How to split and

Re: Support for Windowing and Analytics functions in Spark SQL

2015-06-22 Thread ayan guha
1.4 supports it On 23 Jun 2015 02:59, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, Though the documentation does not explicitly mention support for Windowing and Analytics function in Spark SQL, looks like it is not supported. I tried running a query like Select Lead(column name,

Re: Help optimising Spark SQL query

2015-06-22 Thread Yin Huai
Hi James, Maybe it's the DISTINCT causing the issue. I rewrote the query as follows. Maybe this one can finish faster. select sum(cnt) as uses, count(id) as users from ( select count(*) cnt, cast(id as string) as id, from usage_events where

External Jar file with SparkR

2015-06-22 Thread mtn111
I have been unsuccessful with incorporating an external Jar into a SparkR program. Does anyone know how to do this successfully? JarTest.java = package com.myco; public class JarTest { public static double myStaticMethod() { return 5.515; } } =

Re: PySpark on YARN port out of range

2015-06-22 Thread Andrew Or
Unfortunately there is not a great way to do it without modifying Spark to print more things it reads from the stream. 2015-06-20 23:10 GMT-07:00 John Meehan meeh...@dls.net: Yes it seems to be consistently port out of range:1315905645”. Is there any way to see what the python process is

Re: Help optimising Spark SQL query

2015-06-22 Thread ayan guha
You may also want to change count(*) to specific column. On 23 Jun 2015 01:29, James Aley james.a...@swiftkey.com wrote: Hello, A colleague of mine ran the following Spark SQL query: select count(*) as uses, count (distinct cast(id as string)) as users from usage_events where

Re: SQL vs. DataFrame API

2015-06-22 Thread Ignacio Blasco
Probably you should use === instead of == and !== instead of != Can anyone explain why the dataframe API doesn't work as I expect it to here? It seems like the column identifiers are getting confused. https://gist.github.com/dokipen/4b324a7365ae87b7b0e5

Why can't I allocate more than 4 executors with 2 machines on YARN?

2015-06-22 Thread Saiph Kappa
Hi, I am running a simple spark streaming application on hadoop 2.7.0/YARN (master: yarn-client) cluster with 2 different machines (12GB RAM with 8 CPU cores each). I am launching my application like this: ~/myapp$ ~/my-spark/bin/spark-submit --class App --master yarn-client --driver-memory 4g

Re: Help optimising Spark SQL query

2015-06-22 Thread Jörn Franke
Generally (not only spark sql specific) you should not cast in the where part of a sql query. It is also not necessary in your case. Getting rid of casts in the whole query will be also beneficial. Le lun. 22 juin 2015 à 17:29, James Aley james.a...@swiftkey.com a écrit : Hello, A colleague

Re: Multiple executors writing file using java filewriter

2015-06-22 Thread Richard Marscher
Is spoutLog just a non-spark file writer? If you run that in the map call on a cluster its going to be writing in the filesystem of the executor its being run on. I'm not sure if that's what you intended. On Mon, Jun 22, 2015 at 1:35 PM, anshu shukla anshushuk...@gmail.com wrote: Running

Re: Multiple executors writing file using java filewriter

2015-06-22 Thread anshu shukla
Thanx for reply !! YES , Either it should write on any machine of cluster or Can you please help me ... that how to do this . Previously i was using writing using collect () , so some of my tuples are missing while writing. //previous logic that was just creating the file on master -

SQL vs. DataFrame API

2015-06-22 Thread Bob Corsaro
Can anyone explain why the dataframe API doesn't work as I expect it to here? It seems like the column identifiers are getting confused. https://gist.github.com/dokipen/4b324a7365ae87b7b0e5

Re: SQL vs. DataFrame API

2015-06-22 Thread Bob Corsaro
That's invalid syntax. I'm pretty sure pyspark is using a DSL to create a query here and not actually doing an equality operation. On Mon, Jun 22, 2015 at 3:43 PM Ignacio Blasco elnopin...@gmail.com wrote: Probably you should use === instead of == and !== instead of != Can anyone explain why

Re: Storing an action result in HDFS

2015-06-22 Thread Chris Gore
Hi Ravi, For this case, you could simply do sc.parallelize([rdd.first()]).saveAsTextFile(“hdfs:///my_file”) using pyspark or sc.parallelize(Array(rdd.first())).saveAsTextFile(“hdfs:///my_file”) using Scala Chris On Jun 22, 2015, at 5:53 PM, ddpis...@gmail.com wrote: Hi Chris, Thanks for

Re: Serializer not switching

2015-06-22 Thread Josh Rosen
My hunch is that you changed spark.serializer to Kryo but left spark.closureSerializer unmodified, so it's still using Java for closure serialization. Kryo doesn't really work as a closure serializer but there's an open pull request to fix this: https://github.com/apache/spark/pull/6361 On Mon,

Re: Confusion matrix for binary classification

2015-06-22 Thread CD Athuraliya
Hi Burak, Thanks for the response. I am using Spark version 1.3.0 through Java API. Regards, CD On Tue, Jun 23, 2015 at 5:11 AM, Burak Yavuz brk...@gmail.com wrote: Hi, In Spark 1.4, you may use DataFrame.stat.crosstab to generate the confusion matrix. This would be very simple if you are

Any way to retrieve time of message arrival to Kafka topic, in Spark Streaming?

2015-06-22 Thread dgoldenberg
Is there any way to retrieve the time of each message's arrival into a Kafka topic, when streaming in Spark, whether with receiver-based or direct streaming? Thanks. -- View this message in context:

RE: Support for Windowing and Analytics functions in Spark SQL

2015-06-22 Thread Cheng, Hao
Yes, with should be with HiveContext, not SQLContext. From: ayan guha [mailto:guha.a...@gmail.com] Sent: Tuesday, June 23, 2015 2:51 AM To: smazumder Cc: user Subject: Re: Support for Windowing and Analytics functions in Spark SQL 1.4 supports it On 23 Jun 2015 02:59, Sourav Mazumder

Re: Velox Model Server

2015-06-22 Thread Nick Pentreath
How large are your models? Spark job server does allow synchronous job execution and with a warm long-lived context it will be quite fast - but still in the order of a second or a few seconds usually (depending on model size - for very large models possibly quite a lot more than that).

RE: Question about SPARK_WORKER_CORES and spark.task.cpus

2015-06-22 Thread Cheng, Hao
It’s actually not that tricky. SPARK_WORKER_CORES: is the max task thread pool size of the of the executor, the same saying of “one executor with 32 cores and the executor could execute 32 tasks simultaneously”. Spark doesn’t care about how much real physical CPU/Cores you have (OS does), so

Re: Serializer not switching

2015-06-22 Thread Akhil Das
How are you submitting the application? Could you paste the code that you are running? Thanks Best Regards On Mon, Jun 22, 2015 at 5:37 PM, Sean Barzilay sesnbarzi...@gmail.com wrote: I am trying to run a function on every line of a parquet file. The function is in an object. When I run the

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Akhil Das
Like this? val rawXmls = ssc.fileStream(path, classOf[XmlInputFormat], classOf[LongWritable], classOf[Text]) Thanks Best Regards On Mon, Jun 22, 2015 at 5:45 PM, Yong Feng fengyong...@gmail.com wrote: Thanks a lot, Akhil I saw this mail thread before, but still do not understand how

Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-22 Thread Ashish Soni
Hi All , What is the Best Way to install and Spark Cluster along side with Hadoop Cluster , Any recommendation for below deployment topology will be a great help *Also Is it necessary to put the Spark Worker on DataNodes as when it read block from HDFS it will be local to the Server / Worker or

Re: Serializer not switching

2015-06-22 Thread Sean Barzilay
My program is written in Scala. I am creating a jar and submitting it using spark-submit. My code is on a computer in an internal network withe no internet so I can't send it. On Mon, Jun 22, 2015, 3:19 PM Akhil Das ak...@sigmoidanalytics.com wrote: How are you submitting the application? Could

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Yong Feng
Thanks a lot, Akhil I saw this mail thread before, but still do not understand how to use XmlInputFormatof mahout in Spark Streaming (I am not Spark Streaming Expert yet ;-)). Can you show me some sample code for explanation. Thanks in advance, Yong On Mon, Jun 22, 2015 at 6:44 AM, Akhil Das

Re: Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-22 Thread Akhil Das
Option 1 should be fine, Option 2 would bound a lot on network as the data increase in time. Thanks Best Regards On Mon, Jun 22, 2015 at 5:59 PM, Ashish Soni asoni.le...@gmail.com wrote: Hi All , What is the Best Way to install and Spark Cluster along side with Hadoop Cluster , Any

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Yong Feng
Thanks Akhil I will have a try and then go back to you Yong On Mon, Jun 22, 2015 at 8:25 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Like this? val rawXmls = ssc.fileStream(path, classOf[XmlInputFormat], classOf[LongWritable], classOf[Text]) Thanks Best Regards On Mon, Jun

Re: Using Accumulators in Streaming

2015-06-22 Thread Michal Čizmazia
I stumbled upon zipWithUniqueId/zipWithIndex. Is this what you are looking for? https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaRDDLike.html#zipWithUniqueId() On 22 June 2015 at 06:16, Michal Čizmazia mici...@gmail.com wrote: If I am not mistaken, one way to see

Re: understanding on the waiting batches and scheduling delay in Streaming UI

2015-06-22 Thread Fang, Mike
Hi Das, Thanks for your reply. Somehow I missed it.. I am using Spark 1.3. The data source is from kafka. Yeah, not sure why the delay is 0. I'll run against 1.4 and give a screenshot. Thanks, Mike From: Akhil Das ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com Date: Thursday, June

Re: Registering custom metrics

2015-06-22 Thread dgoldenberg
Hi Gerard, Have there been any responses? Any insights as to what you ended up doing to enable custom metrics? I'm thinking of implementing a custom metrics sink, not sure how doable that is yet... Thanks. -- View this message in context:

Re: Custom Metrics Sink

2015-06-22 Thread dgoldenberg
Hi, I was wondering if there've been any responses to this? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Metrics-Sink-tp10068p23425.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

  1   2   >