RE: Why is my spark executor is terminated?

2015-10-14 Thread Wang, Ningjun (LNG-NPV)
I checked master log before and did not find anything wrong. Unfortunately I have lost the master log now. So you think master log will tell you why executor is down? Regards, Ningjun Wang -Original Message- From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] Sent: Tuesday,

Dynamic partitioning pruning

2015-10-14 Thread Younes Naguib
Hi, This feature was added in Hive 1.3. https://issues.apache.org/jira/browse/HIVE-9152 Any idea when this would be in Spark? Or is it already? Any work around in spark 1.5.1? Thanks, Younes

RE: Node afinity for Kafka-Direct Stream

2015-10-14 Thread prajod.vettiyattil
Hi, Another point is the in the receiver based approach, all the data from kafka first goes to the Worker where the receiver runs https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md Also if you create one stream (which is the normal case), and you have many worker

Get *document*-topic distribution from PySpark LDA model?

2015-10-14 Thread moustachio
Hi! I already have a StackOverflow question on this (see here ), but haven't received any responses, so I thought I'd try here! Long story short, I'm working in PySpark and have successfully

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Cody Koeninger
Assumptions about locality in spark are not very reliable, regardless of what consumer you use. Even if you have locality preferences, and locality wait turned up really high, you still have to account for losing executors. On Wed, Oct 14, 2015 at 8:23 AM, Gerard Maas

NullPointerException when adding to accumulator

2015-10-14 Thread Sela, Amit
I'm running a simple streaming application that reads from Kafka, maps the events and prints them and I'm trying to use accumulators to count the number of mapped records. While this works in standalone(IDE), when submitting to YARN I get NullPointerException on accumulator.add(1) or

Re: Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-14 Thread Jonathan Kelly
Ah, yes, it will use private IPs, so you may need to update your FoxyProxy settings to include the private IPs in the regex as well as the public IPs. Also, yes, for completed applications you may use the Spark History Server on port 18080. The YARN ProxyServer will automatically redirect to the

Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores
My current version of spark is 1.3.0 and my question is the next: I have large data frames where the main field is an user id. I need to do many group by's and joins using that field. Do the performance will increase if before doing any group by or join operation I first convert to rdd to

Re: How to calculate percentile of a column of DataFrame?

2015-10-14 Thread Umesh Kacha
Hi Ted thanks much for your help. So fix is in JIRA 10671 and it is suppose to release in spark 1.6.0 right? Until 1.6.0 is released I won't be able to invoke callUdf using string and percentile_approx with lit as argument right On Oct 14, 2015 03:26, "Ted Yu" wrote: > I

Re: Why is my spark executor is terminated?

2015-10-14 Thread Jean-Baptiste Onofré
Hi Ningjun I just wanted to check that the master didn't "kick out" the worker, as the "Disassociated" can come from the master. Here it looks like the worker killed the executor before shutting down itself. What's the Spark version ? Regards JB On 10/14/2015 04:42 PM, Wang, Ningjun

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores
Thanks Michael for your input. By 1) do you mean: - Caching the partitioned_rdd - Caching the partitioned_df - *Or* just caching unpartitioned_df without the need of creating the partitioned_rdd variable? Can you expand a little bit more 2) Thanks! On Wed, Oct 14, 2015 at 12:11

Re: Programmatically connect to remote YARN in yarn-client mode

2015-10-14 Thread Marcelo Vanzin
On Wed, Oct 14, 2015 at 10:01 AM, Florian Kaspar wrote: > we are working on a project running on Spark. Currently we connect to a > remote Spark-Cluster in Standalone mode to obtain the SparkContext using > > new JavaSparkContext(new >

Re: Spark 1.5 java.net.ConnectException: Connection refused

2015-10-14 Thread Spark Newbie
Is it slowing things down or blocking progress. >> I didn't see slowing of processing, but I do see jobs aborted consecutively for a period of 18 batches (5 minute batch intervals). So I am worried about what happened to the records that these jobs were processing. Also, one more thing to mention

Re: [SQL] Memory leak with spark streaming and spark sql in spark 1.5.1

2015-10-14 Thread Reynold Xin
+dev list On Wed, Oct 14, 2015 at 1:07 AM, Terry Hoo wrote: > All, > > Does anyone meet memory leak issue with spark streaming and spark sql in > spark 1.5.1? I can see the memory is increasing all the time when running > this simple sample: > > val sc = new

Strange spark problems among different versions

2015-10-14 Thread xia zhao
Hi. I try to run the Spark Pi on the cluster, some strange errors happen and I do not know what cause the error. When I am using the hadoop2.6 and spark-1.5.1-bin-hadoop2.6 the error log is below: 118 10/01/01 11:59:14 ERROR yarn.ApplicationMaster: User class threw exception:

Re: Building with SBT and Scala 2.11

2015-10-14 Thread Jakob Odersky
[Repost to mailing list] Hey, Sorry about the typo, I of course meant hadoop-2.6, not 2.11. I suspect something bad happened with my Ivy cache, since when reverting back to scala 2.10, I got a very strange IllegalStateException (something something IvyNode, I can't remember the details). Kilking

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Michael Armbrust
This won't help as for two reasons: 1) Its all still just creating lineage since you aren't caching the partitioned data. It will still fetch the shuffled blocks for each query. 2) The query optimizer is not aware of RDD level partitioning since its mostly a blackbox. 1) could be fixed by

Re: Programmatically connect to remote YARN in yarn-client mode

2015-10-14 Thread Florian Kaspar
Thank you, Marcelo, so it is possible to simply copy the YARN configuration from the remote cluster to the local machine (assuming, the local machine can resolve the YARN host etc.) and just letting Spark do the rest? This would actually be great! Our "local" machine is just another virtual

Re: Programmatically connect to remote YARN in yarn-client mode

2015-10-14 Thread Marcelo Vanzin
On Wed, Oct 14, 2015 at 10:29 AM, Florian Kaspar wrote: > so it is possible to simply copy the YARN configuration from the remote > cluster to the local machine (assuming, the local machine can resolve the > YARN host etc.) and just letting Spark do the rest? > Yes,

Programmatically connect to remote YARN in yarn-client mode

2015-10-14 Thread Florian Kaspar
Hey everyone, we are working on a project running on Spark. Currently we connect to a remote Spark-Cluster in Standalone mode to obtain the SparkContext using new JavaSparkContext(new SparkConf().setAppName("").setMaster("spark://:7077")); Currently, we try to connect to a remote (!) YARN

Re: Building with SBT and Scala 2.11

2015-10-14 Thread Adrian Tanase
You are correct, of course. Gave up on sbt for spark long ago, I never managed to get it working while mvn works great. Sent from my iPhone On 14 Oct 2015, at 16:52, Ted Yu > wrote: Adrian: Likely you were using maven. Jakob's report was with

Reusing Spark Functions

2015-10-14 Thread Starch, Michael D (398M)
All, Is a Function object in Spark reused on a given executor, or is sent and deserialized with each new task? On my project, we have functions that incur a very large setup cost, but then could be called many times. Currently, I am using object deserialization to run this intensive setup,

Re: SPARK SQL Error

2015-10-14 Thread pnpritchard
I think the stack trace is quite informative. Assuming line 10 of CsvDataSource is "val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> args(1),"header"->"true"))", then the "args(1)" call is throwing an ArrayIndexOutOfBoundsException. The reason for this is because you aren't

Re: thriftserver: access temp dataframe from in-memory of spark-shell

2015-10-14 Thread Michael Armbrust
Yes, call startWithContext from the spark shell: https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L56 On Wed, Oct 14, 2015 at 7:10 AM, wrote: > Hi, > > Is it possible to

Re: Spark 1.5 java.net.ConnectException: Connection refused

2015-10-14 Thread Spark Newbie
I ran 2 different spark 1.5 clusters that have been running for more than a day now. I do see jobs getting aborted due to task retry's maxing out (default 4) due to ConnectionException. It seems like the executors die and get restarted and I was unable to find the root cause (same app code and

Re: Spark DataFrame GroupBy into List

2015-10-14 Thread Deenar Toraskar
collect_set and collect_list are built-in User Defined functions see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF On 14 October 2015 at 03:45, SLiZn Liu wrote: > Hi Michael, > > Can you be more specific on `collect_set`? Is it a built-in function

Re: OutOfMemoryError When Reading Many json Files

2015-10-14 Thread Deenar Toraskar
Hi Why dont you check if you can just process the large file standalone and then do the outer loop next. sqlContext.read.json(jsonFile) .select($"some", $"fields") .withColumn( "new_col", some_transformations($"col")) .rdd.map( x: Row => (k, v) ) .combineByKey() Deenar On 14 October 2015 at

spark sql OOM

2015-10-14 Thread Andy Zhao
Hi guys, I'm testing sparkSql 1.5.1, and I use hadoop-2.5.0-cdh5.3.2. One sql which can ran successfully using hive failed when I ran it using sparkSql. I got the following errno:

Re: Machine learning with spark (book code example error)

2015-10-14 Thread Nick Pentreath
Hi there. I'm the author of the book (thanks for buying it by the way :) Ideally if you're having any trouble with the book or code, it's best to contact the publisher and submit a query ( https://www.packtpub.com/books/content/support/17400) However, I can help with this issue. The problem is

[SQL] Memory leak with spark streaming and spark sql in spark 1.5.1

2015-10-14 Thread Terry Hoo
All, Does anyone meet memory leak issue with spark streaming and spark sql in spark 1.5.1? I can see the memory is increasing all the time when running this simple sample: val sc = new SparkContext(conf) val sqlContext = new HiveContext(sc) import sqlContext.implicits._

Re: HiveThriftServer not registering with Zookeeper

2015-10-14 Thread Xiaoyu Wang
I create a jira and pull request for this issue. https://issues.apache.org/jira/browse/SPARK-11100 在 2015年10月13日 16:36, Xiaoyu Wang 写道: I have the same issue. I think spark thrift server is not suport HA with zookeeper now. 在 2015年09月01日 18:10, sreeramvenkat 写道: Hi, I am trying to setup

Node afinity for Kafka-Direct Stream

2015-10-14 Thread Gerard Maas
In the receiver-based kafka streaming model, given that each receiver starts as a long-running task, one can rely in a certain degree of data locality based on the kafka partitioning: Data published on a given topic/partition will land on the same spark streaming receiving node until the receiver

EdgeTriplet showing two versions of the same vertex

2015-10-14 Thread rohit13k
Hi I have a scenario where in the graph I am doing graph.vertices.collect() and getting the 5 vertex i added each of my vertex is an scala object as shown below class NodeExact(nodeId: Long, summ: Array[collection.mutable.Map[Long, Long]]) extends Serializable { var node: Long = nodeId var

Re: graphx - mutable?

2015-10-14 Thread rohit13k
Hi I am also working on the same area where the graph evolves over time and the current approach of rebuilding the graph again and again is very slow and memory consuming did you find any workaround? What was your usecase? -- View this message in context:

Re: Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-14 Thread Joshua Fox
Thank you! It seems that the the history server at port 18080 also gives access to the Spark GUI as below Following your tip, I see that the YARN ResourceManager GUI on 8088 indeed has that ApplicationMaster link, though to a private rather than public IP; replacing IPs brings me to the same

java.io.InvalidClassException using spark1.4.1 for Terasort

2015-10-14 Thread Shreeharsha G Neelakantachar
Hi, I have Terasort being executed on spark1.4.1 with hadoop 2.7 for a datasize of 1TB. When i change my os user from spark1 to hduser, i am observing below exception. Please let me know what is wrong here. I tried to update scala-2.10 to 2.11 and compiled Terasort scala code using sbt.

Re: Machine learning with spark (book code example error)

2015-10-14 Thread Fengdong Yu
Don’t recommend this code style, you’d better brace the function block. val testLabels = testRDD.map { case (file, text) => { val topic = file.split("/").takeRight(2).head newsgroupsMap(topic) } } > On Oct 14, 2015, at 15:46, Nick Pentreath wrote: > > Hi there.

spark sql OOM

2015-10-14 Thread Andy Zhao
Hi guys, I'm testing sparkSql 1.5.1, and I use hadoop-2.5.0-cdh5.3.2. One sql which can ran successfully using hive failed when I ran it using sparkSql. I got the following errno:

If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-14 Thread Reynold Xin
Can you reply to this email and provide us with reasons why you disable it? Thanks.

Re: TTL for saveAsObjectFile()

2015-10-14 Thread Calvin Jia
Hi Antonio, I don't think Spark provides a way to pass down params with saveAsObjectFile. One way could be to pass a default TTL in the configuration, but the approach doesn't make much sense since TTL is not necessarily uniform. Baidu will be talking about their use of TTL in Tachyon with Spark

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Gerard Maas
Hi Cody, I think that I misused the term 'data locality'. I think I should better call it "node affinity" instead, as this is what I would like to have: For as long as an executor is available, I would like to have the same kafka partition processed by the same node in order to take advantage of

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Cody Koeninger
What I'm saying is that it's not a given with spark, even in receiver-based mode, because as soon as you lose an executor you'll have a rebalance. Spark's model in general isn't a good fit for pinning work to specific nodes. If you really want to try and fake this, you can override

Re: Spark 1.5 java.net.ConnectException: Connection refused

2015-10-14 Thread Tathagata Das
When a job gets aborted, it means that the internal tasks were retried a number of times before the system gave up. You can control the number retries (see Spark's configuration page). The job by default does not get resubmitted. You could try getting the logs of the failed executor, to see what

Re: spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

2015-10-14 Thread Josh Rosen
Can you report this as an issue at https://github.com/databricks/spark-avro/issues so that it's easier to track? Thanks! On Wed, Oct 14, 2015 at 1:38 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > I save my dataframe to avro with spark-avro 1.0.0 and it looks like this > (using

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Michael Armbrust
Caching the partitioned_df <- this one, but you have to do the partitioning using something like sql("SELECT * FROM ... CLUSTER BY a") as there is no such operation exposed on dataframes. 2) Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-5354

Re: Problem installing Sparck on Windows 8

2015-10-14 Thread Marco Mistroni
Thanks Steve followed instruction, spk is started and i can see the web ui however after launching spark-shell i am getting another exception. is this preventing me from actually using spark? kind regards marco 15/10/14 20:52:35 WARN : Your hostname, MarcoLaptop resolves to a loopback/non-r

Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Bernardo Vecchia Stein
Hi Deenar, Yes, the native library is installed on all machines of the cluster. I tried a simpler approach by just using System.load() and passing the exact path of the library, and things still won't work (I get exactly the same error and message). Any ideas of what might be failing? Thank

Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Renato Marroquín Mogrovejo
Hi Bernardo, So is this in distributed mode? or single node? Maybe fix the issue with a single node first ;) You are right that Spark finds the library but not the *.so file. I also use System.load() with LD_LIBRARY_PATH set, and I am able to execute without issues. Maybe you'd like to double

Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Bernardo Vecchia Stein
Hi Renato, I am using a single master and a single worker node, both in the same machine, to simplify everything. I have tested with System.loadLibrary() as well (setting all the necessary paths) and get the same error. Just double checked everything and the parameters are fine. Bernardo On 14

spark-shell :javap fails with complaint about JAVA_HOME, but it is set correctly

2015-10-14 Thread Robert Dodier
Hi, I am working with Spark 1.5.1 (official release), with Oracle Java8, on Ubuntu 14.04. echo $JAVA_HOME says "/usr/lib/jvm/java-8-oracle". I'd like to use :javap in spark-shell, but I get an error message: scala> :javap java.lang.Object Failed: Could not load javap tool. Check that JAVA_HOME

Re: Problem installing Sparck on Windows 8

2015-10-14 Thread Steve Loughran
On 14 Oct 2015, at 20:56, Marco Mistroni > wrote: 15/10/14 20:52:35 WARN : Your hostname, MarcoLaptop resolves to a loopback/non-r eachable address: fe80:0:0:0:c5ed:a66d:9d95:5caa%wlan2, but we couldn't find any external IP address!

Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Renato Marroquín Mogrovejo
Sorry Bernardo, I just double checked. I use: System.loadLibrary(); Could you also try that? Renato M. 2015-10-14 21:51 GMT+02:00 Renato Marroquín Mogrovejo < renatoj.marroq...@gmail.com>: > Hi Bernardo, > > So is this in distributed mode? or single node? Maybe fix the issue with a >

Spark streaming checkpoint against s3

2015-10-14 Thread Tian Zhang
Hi, I am trying to set spark streaming checkpoint to s3, here is what I did basically val checkpoint = "s3://myBucket/checkpoint" val ssc = StreamingContext.getOrCreate(checkpointDir, () => getStreamingContext(sparkJobName,

Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Renato Marroquín Mogrovejo
You can also try setting the env variable LD_LIBRARY_PATH to point where your compiled libraries are. Renato M. 2015-10-14 21:07 GMT+02:00 Bernardo Vecchia Stein : > Hi Deenar, > > Yes, the native library is installed on all machines of the cluster. I > tried a

Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Bernardo Vecchia Stein
Hi Renato, I have done that as well, but so far no luck. I believe spark is finding the library correctly, otherwise the error message would be "no libraryname found" or something like that. The problem seems to be something else, and I'm not sure how to find it. Thanks, Bernardo On 14 October

spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

2015-10-14 Thread Alex Nastetsky
I save my dataframe to avro with spark-avro 1.0.0 and it looks like this (using avro-tools tojson): {"field1":"value1","field2":976200} {"field1":"value2","field2":976200} {"field1":"value3","field2":614100} But when I use spark-avro 2.0.1, it looks like this:

PySpark - Hive Context Does Not Return Results but SQL Context Does for Similar Query.

2015-10-14 Thread charles.drotar
I have duplicated my submission to stack overflow below since it is exactly the same question I would like to post here as well. Please don't judge me too harshly for my laziness *The questions I

Spark 1.5.1 ClassNotFoundException in cluster mode.

2015-10-14 Thread Renato Perini
Hello. I have developed a Spark job using a jersey client (1.9 included with Spark) to make some service calls during data computations. Data is read and written on an Apache Cassandra 2.2.1 database. When I run the job in local mode, everything works nicely. But when I execute my job in

Re: spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

2015-10-14 Thread Alex Nastetsky
Here you go: https://github.com/databricks/spark-avro/issues/92 Thanks. On Wed, Oct 14, 2015 at 4:41 PM, Josh Rosen wrote: > Can you report this as an issue at > https://github.com/databricks/spark-avro/issues so that it's easier to > track? Thanks! > > On Wed, Oct 14,

IPv6 regression in Spark 1.5.1

2015-10-14 Thread Thomas Dudziak
It looks like Spark 1.5.1 does not work with IPv6. When adding -Djava.net.preferIPv6Addresses=true on my dual stack server, the driver fails with: 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext. java.lang.AssertionError: assertion failed: Expected hostname at

Re: IPv6 regression in Spark 1.5.1

2015-10-14 Thread Thomas Dudziak
Specifically, something like this should probably do the trick: def checkHost(host: String, message: String = "") { assert(!HostAndPort.fromString(host).hasPort, message) } def checkHostPort(hostPort: String, message: String = "") { assert(HostAndPort.fromString(hostPort).hasPort,

Spark Master Dying saying TimeoutException

2015-10-14 Thread Kartik Mathur
Hi, I have some nightly jobs which runs every night but dies sometimes because of unresponsive master , spark master logs says - Not seeing much else there , what could possible cause an exception like this. *Exception in thread "main" java.util.concurrent.TimeoutException: Futures timed out

Re: Spark 1.5.1 ClassNotFoundException in cluster mode.

2015-10-14 Thread Dean Wampler
There is a Datastax Spark connector library jar file that you probably have on your CLASSPATH locally, but not on the cluster. If you know where it is, you could either install it on each node in some location on their CLASSPATHs or when you submit the mob, pass the jar file using the "--jars"

Re: Spark streaming checkpoint against s3

2015-10-14 Thread Tian Zhang
It looks like that reconstruction of SparkContext from checkpoint data is trying to look for the jar file of previous failed runs. It can not find the jar files as our jar files are on local machines and were cleaned up after each failed run. -- View this message in context:

Re: PySpark - Hive Context Does Not Return Results but SQL Context Does for Similar Query.

2015-10-14 Thread Michael Armbrust
No link to the original stack overflow so I can up my reputation? :) This is likely not a difference between HiveContext/SQLContext, but instead a difference between a table where the metadata is coming from the HiveMetastore vs the SparkSQL Data Source API. I would guess that if you create the

stability of Spark 1.4.1 with Python 3 versions

2015-10-14 Thread shoira.mukhsin...@bnpparibasfortis.com
Dear Spark Community, The official documentation of Spark 1.4.1 mentions that Spark runs on Python 2.6+ http://spark.apache.org/docs/1.4.1/ It is not clear if by "Python 2.6+" do you also mean Python 3.4 or not. There is a resolved issue on this point which makes me believe that it does run on

Re: java.io.InvalidClassException using spark1.4.1 for Terasort

2015-10-14 Thread Sonal Goyal
This is probably a versioning issue, are you sure your code is compiling and running against the same versions? On Oct 14, 2015 2:19 PM, "Shreeharsha G Neelakantachar" < shreeharsh...@in.ibm.com> wrote: > Hi, > I have Terasort being executed on spark1.4.1 with hadoop 2.7 for a > datasize of

Re: spark sql OOM

2015-10-14 Thread Andy Zhao
I increased executor memory from 6g to 10g, but it still failed and report the same error and because of my company security policy, I cannot write the sql out. But I'm sure that this error occurred in the compute method of HadoopRDD, and this error happened in one of executors. -- View this

Re: OutOfMemoryError When Reading Many json Files

2015-10-14 Thread SLiZn Liu
Yes it went wrong when processing a large file only. I removed transformations on DF, and it worked just fine. But doing a simple filter operation on the DF became the last straw that breaks the camel’s back. That’s confusing. ​ On Wed, Oct 14, 2015 at 2:11 PM Deenar Toraskar

Re: spark sql OOM

2015-10-14 Thread cherrywayb...@gmail.com
Hi,pls increase your memory . cherrywayb...@gmail.com From: Andy Zhao Date: 2015-10-14 17:40 To: user Subject: spark sql OOM Hi guys, I'm testing sparkSql 1.5.1, and I use hadoop-2.5.0-cdh5.3.2. One sql which can ran successfully using hive failed when I ran it using sparkSql. I got the

Fwd: Partition Column in JDBCRDD or Datasource API

2015-10-14 Thread satish chandra j
HI All, Please give me some inputs on *Partition Column *to be used in DataSourceAPI or JDBCRDD to define Lowerbound and Upperbound value which would be used to define No. of partitions, but issue is my source table does not have a Numeric Columns which is sequential and unique such that proper

Re: Problem installing Sparck on Windows 8

2015-10-14 Thread Raghavendra Pandey
Looks like you are facing ipv6 issue. Can you try using preferIPv4 property on. On Oct 15, 2015 2:10 AM, "Steve Loughran" wrote: > > On 14 Oct 2015, at 20:56, Marco Mistroni wrote: > > > 15/10/14 20:52:35 WARN : Your hostname, MarcoLaptop resolves to

Re: Application not found in Spark historyserver in yarn-client mode

2015-10-14 Thread Ted Yu
Which Spark release are you using ? Thanks On Wed, Oct 14, 2015 at 4:20 PM, Anfernee Xu wrote: > Hi, > > Here's the problem I'm facing, I have a standalone java application which > is periodically submit Spark jobs to my yarn cluster, btw I'm not using > 'spark-submit'

Re: Spark Master Dying saying TimeoutException

2015-10-14 Thread Raghavendra Pandey
I fixed these timeout errors by retrying... On Oct 15, 2015 3:41 AM, "Kartik Mathur" wrote: > Hi, > > I have some nightly jobs which runs every night but dies sometimes because > of unresponsive master , spark master logs says - > > Not seeing much else there , what could

Re: "java.io.IOException: Filesystem closed" on executors

2015-10-14 Thread Lan Jiang
Thank you, Akhil. Actually the problem was solved last week and I did not have time to report back. The error was caused by YARN killing the container because executors use more off-heap memory that they were assigned. There was nothing in the exectuor log, but the AM log clearly states this is

spark streaming filestream API

2015-10-14 Thread Chandra Mohan, Ananda Vel Murugan
Hi All, I have a directory hdfs which I want to monitor and whenever there is a new file in it, I want to parse that file and load the contents into a HIVE table. File format is proprietary and I have java parsers for parsing it. I am building a spark streaming application for this workflow.

Re: spark streaming filestream API

2015-10-14 Thread Akhil Das
Key and Value are the ones that you are using with your InputFormat. Eg: JavaReceiverInputDStream lines = jssc.fileStream("/sigmoid", LongWritable.class, Text.class, TextInputFormat.class); TextInputFormat uses the LongWritable as Key and Text as Value classes. If your data is plain CSV or text

Re: how to use SharedSparkContext

2015-10-14 Thread Fengdong Yu
oh, Yes. Thanks much. > On Oct 14, 2015, at 18:47, Akhil Das wrote: > > com.holdenkarau.spark.testing

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Saisai Shao
You could check the code of KafkaRDD, the locality (host) is got from Kafka's partition and set in KafkaRDD, this will a hint for Spark to schedule task on the preferred location. override def getPreferredLocations(thePart: Partition): Seq[String] = { val part =

Re: Changing application log level in standalone cluster

2015-10-14 Thread Akhil Das
You should be able to do that from your application. In the beginning of the application, just add: import org.apache.log4j.Loggerimport org.apache.log4j.Level Logger.getLogger("org").setLevel(Level.OFF)Logger.getLogger("akka").setLevel(Level.OFF) That will switch off the logs. Thanks Best

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Rishitesh Mishra
Hi Gerard, I am also trying to understand the same issue. Whatever code I have seen it looks like once Kafka RDD is constructed the execution of that RDD is upto the task scheduler and it can schedule the partitions based on the load on nodes. There is preferred node specified in Kafks RDD. But

RE: Spark DataFrame GroupBy into List

2015-10-14 Thread java8964
My guess is the same as UDAF of (collect_set) in Hive. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF) Yong From: sliznmail...@gmail.com Date: Wed, 14 Oct 2015 02:45:48 + Subject: Re: Spark DataFrame GroupBy into List To:

Re: spark sql OOM

2015-10-14 Thread Fengdong Yu
Can you search the mail-archive before asked the question? at least search for how ask the question. nobody can give your answer if you don’t paste your SQL or SparkSQL code. > On Oct 14, 2015, at 17:40, Andy Zhao wrote: > > Hi guys, > > I'm testing sparkSql 1.5.1,

Re: Cannot connect to standalone spark cluster

2015-10-14 Thread Akhil Das
Open a spark-shell by: MASTER=Ellens-MacBook-Pro.local:7077 bin/spark-shell And if its able to connect, then check your java projects build file and make sure you are having the proper spark version. Thanks Best Regards On Sat, Oct 10, 2015 at 3:07 AM, ekraffmiller

Re: how to use SharedSparkContext

2015-10-14 Thread Akhil Das
Did a quick search and found the following, I haven't tested it myself. Add the following to your build.sbt libraryDependencies += "com.holdenkarau" % "spark-testing-base_2.10" % "1.5.0_1.4.0_1.4.1_0.1.2" Create a class extending com.holdenkarau.spark.testing.SharedSparkContext And you

Sensitivity analysis using Spark MlLib

2015-10-14 Thread Sourav Mazumder
Is there any algorithm implementated in Spark MLLib which supports parameter sensitivity analysis ? After the model is created using a training data set, the model should be able to tell among the various features used which are the ones most important (from the perspective of their contribution

Re: Spark DataFrame GroupBy into List

2015-10-14 Thread Michael Armbrust
Thats correct. It is a Hive UDAF. On Wed, Oct 14, 2015 at 6:45 AM, java8964 wrote: > My guess is the same as UDAF of (collect_set) in Hive. > > > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF) > >

Re: Reusing Spark Functions

2015-10-14 Thread Michael Armbrust
Unless its a broadcast variable, a new copy will be deserialized for every task. On Wed, Oct 14, 2015 at 10:18 AM, Starch, Michael D (398M) < michael.d.sta...@jpl.nasa.gov> wrote: > All, > > Is a Function object in Spark reused on a given executor, or is sent and > deserialized with each new

Re: stability of Spark 1.4.1 with Python 3 versions

2015-10-14 Thread Nicholas Chammas
The Spark 1.4 release notes say that Python 3 is supported. The 1.4 docs are incorrect, and the 1.5 programming guide has been updated to indicate Python 3 support. On Wed, Oct 14, 2015 at 7:06 AM shoira.mukhsin...@bnpparibasfortis.com

Re: writing to hive

2015-10-14 Thread Ted Yu
Can you show your query ? Thanks > On Oct 13, 2015, at 12:29 AM, Hafiz Mujadid wrote: > > hi! > > I am following this > > > tutorial to read and write from hive. But i am facing

RE: spark streaming filestream API

2015-10-14 Thread Chandra Mohan, Ananda Vel Murugan
Hi, Thanks for your response. My input format is the one I have created to handle the files as a whole i.e. WholeFileInputFormat I wrote one based on this example https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3 In

Re: spark streaming filestream API

2015-10-14 Thread Akhil Das
Yes, that is correct. When you import the K,V classes, make sure you import it from the hadoop.io package. import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.NullWritable; Thanks Best Regards On Wed, Oct 14, 2015 at 6:26 PM, Chandra Mohan, Ananda Vel Murugan <

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Saisai Shao
This preferred locality is a hint to spark to schedule Kafka tasks on the preferred nodes, if Kafka and Spark are two separate cluster, obviously this locality hint takes no effect, and spark will schedule tasks following node-local -> rack-local -> any pattern, like any other spark tasks. On

Re: unresolved dependency: org.apache.spark#spark-streaming_2.10;1.5.0: not found

2015-10-14 Thread Ted Yu
This might be related : http://search-hadoop.com/m/q3RTta8AxS1UjMSI=Cannot+get+spark+streaming_2+10+1+5+0+pom+from+the+maven+repository > On Oct 12, 2015, at 11:30 PM, Akhil Das wrote: > > You need to add "org.apache.spark" % "spark-streaming_2.10" % "1.5.0" to the

Re: Node afinity for Kafka-Direct Stream

2015-10-14 Thread Gerard Maas
Thanks Saisai, Mishra, Indeed, that hint will only work on a case where the Spark executor is co-located with the Kafka broker. I think the answer to my question as stated is that there's no warranty of where the task will execute as it will depend on the scheduler and cluster resources

dataframes and numPartitions

2015-10-14 Thread Alex Nastetsky
A lot of RDD methods take a numPartitions parameter that lets you specify the number of partitions in the result. For example, groupByKey. The DataFrame counterparts don't have a numPartitions parameter, e.g. groupBy only takes a bunch of Columns as params. I understand that the DataFrame API is

Re: Spark DataFrame GroupBy into List

2015-10-14 Thread SLiZn Liu
Thanks, Michael and java8964! Does Hive Context also provides udf for combining existing lists, into flattened(not nested) list? (list->list of lists -[flatten]->list). On Thu, Oct 15, 2015 at 1:16 AM Michael Armbrust wrote: > Thats correct. It is a Hive UDAF. > > On

Re: Building with SBT and Scala 2.11

2015-10-14 Thread Ted Yu
Adrian: Likely you were using maven. Jakob's report was with sbt. Cheers On Tue, Oct 13, 2015 at 10:05 PM, Adrian Tanase wrote: > Do you mean hadoop-2.4 or 2.6? not sure if this is the issue but I'm also > compiling the 1.5.1 version with scala 2.11 and hadoop 2.6 and it

thriftserver: access temp dataframe from in-memory of spark-shell

2015-10-14 Thread Saif.A.Ellafi
Hi, Is it possible to load a spark-shell, in which we do any number of operations in a dataframe, then register it as a temporary table and get to see it through thriftserver? ps. or even better, submit a full job and store the dataframe in thriftserver in-memory before the job completes. I

Application not found in Spark historyserver in yarn-client mode

2015-10-14 Thread Anfernee Xu
Hi, Here's the problem I'm facing, I have a standalone java application which is periodically submit Spark jobs to my yarn cluster, btw I'm not using 'spark-submit' or 'org.apache.spark.launcher' to submit my jobs. These jobs are successful and I can see them on Yarn RM webUI, but when I want to

  1   2   >