RE: SparkML algos limitations question.

2016-01-04 Thread Ulanov, Alexander
Hi Yanbo, As long as two models fit into memory of a single machine, there should be no problems, so even 16GB machines can handle large models. (master should have more memory because it runs LBFGS) In my experiments, I’ve trained the models 12M and 32M parameters without issues. Best

Re: Does state survive application restart in StatefulNetworkWordCount?

2016-01-04 Thread Tathagata Das
It does get recovered if you restart from checkpoints. See the example RecoverableNetworkWordCount.scala On Sat, Jan 2, 2016 at 6:22 AM, Rado Buranský wrote: > I am trying to understand how state in Spark Streaming works in general. > If I run this example program twice

Re: Spark Job Server with Yarn and Kerberos

2016-01-04 Thread Michael Segel
Its been a while... but this isn’t a spark issue. A spark job on YARN runs as a regular job. What happens when you run a regular M/R job by that user? I don’t think we did anything special... > On Jan 4, 2016, at 12:22 PM, Mike Wright > wrote: >

RE: Is Spark 1.6 released?

2016-01-04 Thread Saif.A.Ellafi
Where can I read more about the dataset api on a user layer? I am failing to get an API doc or understand when to use DataFrame or DataSet, advantages, etc. Thanks, Saif -Original Message- From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] Sent: Monday, January 04, 2016 2:01 PM To:

Re: Spark Streaming Application is Stuck Under Heavy Load Due to DeadLock

2016-01-04 Thread Shixiong Zhu
Hye Rachana, could you provide the full jstack outputs? Maybe it's same as https://issues.apache.org/jira/browse/SPARK-11104 Best Regards, Shixiong Zhu 2016-01-04 12:56 GMT-08:00 Rachana Srivastava < rachana.srivast...@markmonitor.com>: > Hello All, > > > > I am running my application on Spark

Re: HiveThriftServer fails to quote strings

2016-01-04 Thread Ted Yu
bq. without any of the escape characters: Did you intend to show some sample ? As far as I can tell, there was no sample or image in previous email. FYI On Mon, Jan 4, 2016 at 11:36 AM, sclyon wrote: > Hello all, > > I've got a nested JSON structure in parquet format

Comparing Subsets of an RDD

2016-01-04 Thread Daniel Imberman
Hi, I’m looking for a way to compare subsets of an RDD intelligently. Lets say I had an RDD with key/value pairs of type (Int->T). I eventually need to say “compare all values of key 1 with all values of key 2 and compare values of key 3 to the values of key 5 and key 7”, how would I go about

Re: Is Spark 1.6 released?

2016-01-04 Thread Ted Yu
Please refer to the following: https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets https://spark.apache.org/docs/latest/sql-programming-guide.html#creating-datasets https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets Cheers On Mon, Jan 4, 2016 at

Re: Is Spark 1.6 released?

2016-01-04 Thread Michael Armbrust
I also wrote about it here: https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html And put together a bunch of examples here: https://docs.cloud.databricks.com/docs/spark/1.6/index.html On Mon, Jan 4, 2016 at 12:02 PM, Annabel Melongo < melongo_anna...@yahoo.com.invalid> wrote:

HiveThriftServer fails to quote strings

2016-01-04 Thread sclyon
Hello all, I've got a nested JSON structure in parquet format that I'm having some issues with when trying to query it through Hive. In Spark (1.5.2) the column is represented correctly: However, when queried from Hive I get the same column but without any of the escape characters: Naturally

Re: Is Spark 1.6 released?

2016-01-04 Thread Annabel Melongo
[1] http://spark.apache.org/releases/spark-release-1-6-0.html[2]  http://spark.apache.org/downloads.html On Monday, January 4, 2016 2:59 PM, "saif.a.ell...@wellsfargo.com" wrote: Where can I read more about the dataset api on a user layer? I am failing to

Re: Does state survive application restart in StatefulNetworkWordCount?

2016-01-04 Thread Rado Buranský
I asked the question on Twitter and got this response: https://twitter.com/jaceklaskowski/status/683923649632579588 Is Jacek right? If you stop and then start the application correctly then the state is not recovered? It is recovered only in case of failure? On Mon, Jan 4, 2016 at 8:19 PM,

Spark Streaming Application is Stuck Under Heavy Load Due to DeadLock

2016-01-04 Thread Rachana Srivastava
Hello All, I am running my application on Spark cluster but under heavy load the system is hung due to deadlock. I found similar issues resolved here https://datastax-oss.atlassian.net/browse/JAVA-555 in Spark version 2.1.3. But I am running on Spark 1.3 still getting the same issue. Here

Comparing Subsets of an RDD

2016-01-04 Thread Daniel Imberman
Hi, I’m looking for a way to compare subsets of an RDD intelligently. Lets say I had an RDD with key/value pairs of type (Int->T). I eventually need to say “compare all values of key 1 with all values of key 2 and compare values of key 3 to the values of key 5 and key 7”, how would I go about

Re: HiveThriftServer fails to quote strings

2016-01-04 Thread Scott Lyons
Apparently nabble ate my code samples. In Spark (1.5.2) the column is represented correctly: sqlContext.sql("SELECT * FROM tempdata").collect() [{"PageHtml":"{\\"time\\":0}"}] However, when queried from Hive I get the same column but without any of the escape characters: Beeline (or PyHive) >

Re: pyspark streaming crashes

2016-01-04 Thread Antony Mayi
just for reference in my case this problem is caused by this bug:  https://issues.apache.org/jira/browse/SPARK-12617 On Monday, 21 December 2015, 14:32, Antony Mayi wrote: I noticed it might be related to longer GC pauses (1-2 sec) - the crash usually occurs

Re: Unable to run spark SQL Join query.

2016-01-04 Thread ๏̯͡๏
There are three tables in action here. Table A (success_events.sojsuccessevents1) JOIN TABLE B (dw_bid) to create TABLE C (sojsuccessevents2_spark) Now table success_events.sojsuccessevents1 has itemid that i confirmed by running describe success_events.sojsuccessevents1 from spark-sql shell. I

stopping a process usgin an RDD

2016-01-04 Thread domibd
Hello, Is there a way to stop under a condition a process (like map-reduce) using an RDD ? (this could be use if the process does not always need to explore all the RDD) thanks Dominique -- View this message in context:

Is Spark 1.6 released?

2016-01-04 Thread Jung
Hi There were Spark 1.6 jars in maven central and github. I found it 5 days ago. But it doesn't appear on Spark website now. May I regard Spark 1.6 zip file in github as a stable release? Thanks Jung

Re: Is Spark 1.6 released?

2016-01-04 Thread Jean-Baptiste Onofré
Hi Jung, yes Spark 1.6.0 has been released December, 28th. The artifacts are on Maven Central: http://repo1.maven.org/maven2/org/apache/spark/ However, the distribution is not available on dist.apache.org: https://dist.apache.org/repos/dist/release/spark/ Let me check with the team to

Re: Is Spark 1.6 released?

2016-01-04 Thread Michael Armbrust
> > bq. In many cases, the current implementation of the Dataset API does not > yet leverage the additional information it has and can be slower than RDDs. > > Are the characteristics of cases above known so that users can decide which > API to use ? > Lots of back to back operations aren't great

Re: Monitor Job on Yarn

2016-01-04 Thread Ted Yu
Please look at history server related content under: https://spark.apache.org/docs/latest/running-on-yarn.html Note spark.yarn.historyServer.address FYI On Mon, Jan 4, 2016 at 2:49 PM, Daniel Valdivia wrote: > Hello everyone, happy new year, > > I submitted an app to

Re: Monitor Job on Yarn

2016-01-04 Thread Marcelo Vanzin
You should be looking at the YARN RM web ui to monitor YARN applications; that will have a link to the Spark application's UI, along with other YARN-related information. Also, if you run the app in client mode, it might be easier to debug it until you know it's running properly (since you'll see

Re: groupByKey does not work?

2016-01-04 Thread Ted Yu
Can you give a bit more information ? Release of Spark you're using Minimal dataset that shows the problem Cheers On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra wrote: > I tried groupByKey and noticed that it did not group all values into the > same group. > > In my test

Re: groupByKey does not work?

2016-01-04 Thread Daniel Imberman
Could you please post the associated code and output? On Mon, Jan 4, 2016 at 3:55 PM Arun Luthra wrote: > I tried groupByKey and noticed that it did not group all values into the > same group. > > In my test dataset (a Pair rdd) I have 16 records, where there are only 4 >

Monitor Job on Yarn

2016-01-04 Thread Daniel Valdivia
Hello everyone, happy new year, I submitted an app to yarn, however I'm unable to monitor it's progress on the driver node, not in :8080 or :4040 as documented, when submitting to the standalone mode I could monitor however seems liek its not the case right now. I submitted my app this way:

Re: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2016-01-04 Thread Tathagata Das
You could enforce the evaluation of the transformed DStream by putting a dummy output operation on it, and then do the windowing. transformedDStream.foreachRDD { _.count() } // to enforce evaluation of the trnasformation transformedDStream.window(...).foreachRDD( rdd => ... } On Thu, Dec 31,

Re: Monitor Job on Yarn

2016-01-04 Thread Daniel Valdivia
I see, I guess I should have set the historyServer. Strangely enough peeking in the yarn seems like nothing is "happening", it list a single application running with 0% progress but each node has 0 running containers which confuses me to wether anything is actually happening Should I restart

groupByKey does not work?

2016-01-04 Thread Arun Luthra
I tried groupByKey and noticed that it did not group all values into the same group. In my test dataset (a Pair rdd) I have 16 records, where there are only 4 distinct keys, so I expected there to be 4 records in the groupByKey object, but instead there were 8. Each of the 4 distinct keys appear

Re: groupByKey does not work?

2016-01-04 Thread Arun Luthra
Spark 1.5.0 data: p1,lo1,8,0,4,0,5,20150901|5,1,1.0 p1,lo2,8,0,4,0,5,20150901|5,1,1.0 p1,lo3,8,0,4,0,5,20150901|5,1,1.0 p1,lo4,8,0,4,0,5,20150901|5,1,1.0 p1,lo1,8,0,4,0,5,20150901|5,1,1.0 p1,lo2,8,0,4,0,5,20150901|5,1,1.0

Re: groupByKey does not work?

2016-01-04 Thread Daniel Imberman
Could you try simplifying the key and seeing if that makes any difference? Make it just a string or an int so we can count out any issues in object equality. On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra wrote: > Spark 1.5.0 > > data: > >

Re: copy/mv hdfs file to another directory by spark program

2016-01-04 Thread Don Drake
You will need to use the HDFS API to do that. Try something like: val conf = sc.hadoopConfiguration val fs = org.apache.hadoop.fs.FileSystem.get(conf) fs.rename(new org.apache.hadoop.fs.Path("/path/on/hdfs/file.txt"), new org.apache.hadoop.fs.Path("/path/on/hdfs/other/file.txt")) Full API for

copy/mv hdfs file to another directory by spark program

2016-01-04 Thread Zhiliang Zhu
For some file on hdfs, it is necessary to copy/move it to some another specific hdfs  directory, and the directory name would keep unchanged.Just need finish it in spark program, but not hdfs commands.Is there any codes, it seems not to be done by searching spark doc ... Thanks in advance! 

Re: SparkML algos limitations question.

2016-01-04 Thread Yanbo Liang
Hi Alexander, That's cool! Thanks for the clarification. Yanbo 2016-01-05 5:06 GMT+08:00 Ulanov, Alexander : > Hi Yanbo, > > > > As long as two models fit into memory of a single machine, there should be > no problems, so even 16GB machines can handle large models.

Re: groupByKey does not work?

2016-01-04 Thread Arun Luthra
If I simplify the key to String column with values lo1, lo2, lo3, lo4, it works correctly. On Mon, Jan 4, 2016 at 4:49 PM, Daniel Imberman wrote: > Could you try simplifying the key and seeing if that makes any difference? > Make it just a string or an int so we can

Re: groupByKey does not work?

2016-01-04 Thread Daniel Imberman
That's interesting. I would try case class Mykey(uname:String) case class Mykey(uname:String, c1:Char) case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char, f4:Char, f5:Char, f6:String) In that order. It seems like there is some issue with equality between keys. On Mon, Jan 4,

Re: Problem embedding GaussianMixtureModel in a closure

2016-01-04 Thread Yanbo Liang
Hi Tomasz, The limitation will not be changed and you will found all the models reference to SparkContext in the new Spark ML package. It make the Python API simple for implementation. But it does not means you can only call this function on local data, you can operate this function on an RDD

Re: copy/mv hdfs file to another directory by spark program

2016-01-04 Thread ayan guha
My guess is No, unless you are okay to read the data and write it back again. On Tue, Jan 5, 2016 at 2:07 PM, Zhiliang Zhu wrote: > > For some file on hdfs, it is necessary to copy/move it to some another > specific hdfs directory, and the directory name would keep

problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-04 Thread Andy Davidson
I am having a heck of a time writing a simple transformer in Java. I assume that my Transformer is supposed to append a new column to the dataFrame argument. Any idea why I get the following exception in Java 8 when I try to call DataFrame withColumn()? The JavaDoc says withColumn() "Returns a new

Re: stopping a process usgin an RDD

2016-01-04 Thread Michael Segel
Not really a good idea. It breaks the paradigm. If I understand the OP’s idea… they want to halt processing the RDD, but not the entire job. So when it hits a certain condition, it will stop that task yet continue on to the next RDD. (Assuming you have more RDDs or partitions than you have

Re: stopping a process usgin an RDD

2016-01-04 Thread Daniel Darabos
You can cause a failure by throwing an exception in the code running on the executors. The task will be retried (if spark.task.maxFailures > 1), and then the stage is failed. No further tasks are processed after that, and an exception is thrown on the driver. You could catch the exception and see

Re: Problem embedding GaussianMixtureModel in a closure

2016-01-04 Thread Tomasz Fruboes
Hi Yanbo, thanks for info. Is it likely to change in (near :) ) future? Ability to call this function only on local data (ie not in rdd) seems to be rather serious limitation. cheers, Tomasz On 02.01.2016 09:45, Yanbo Liang wrote: Hi Tomasz, The GMM is bind with the peer Java GMM

Trying to run GraphX ConnectedComponents for large data with out success

2016-01-04 Thread Dagan, Arnon
While trying to run a spark job with spark 1.5.1, using the following paramters: --master "yarn" --deploy-mode "cluster" --num-executors 200 --driver-memory 14G --executor-memory 14G --executor-cores 1 Trying to run graphX ConnectedComponent on large data (~4TB) using the following commands:

unsubscribe

2016-01-04 Thread Irvin
unsubscribe -- Thanks & Best Regards

[discuss] dropping Python 2.6 support

2016-01-04 Thread Reynold Xin
Does anybody here care about us dropping support for Python 2.6 in Spark 2.0? Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json parsing) when compared with Python 2.7. Some libraries that Spark depend on stopped supporting 2.6. We can still convince the library maintainers to

Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-04 Thread Michael Armbrust
Its not really possible to convert an RDD to a Column. You can think of a Column as an expression that produces a single output given some set of input columns. If I understand your code correctly, I think this might be easier to express as a UDF: sqlContext.udf().register("stem", new

Security authentication interface for Spark

2016-01-04 Thread jiehua
Hi All, We are using Spark 1.4.1/1.5.2 standalone mode and would like to add 3rd party user authentication for Spark. We found for batch submission(cluster mode, but not restful), there was Akka authentication (by security cookie to ensure identical between both sides) while client connecting to

[ANNOUNCE] Announcing Spark 1.6.0

2016-01-04 Thread Michael Armbrust
Hi All, Spark 1.6.0 is the seventh release on the 1.x line. This release includes patches from 248+ contributors! To download Spark 1.6.0 visit the downloads page. (It may take a while for all mirrors to update.) A huge thanks go to all of the individuals and organizations involved in

Re: Is Spark 1.6 released?

2016-01-04 Thread Jean-Baptiste Onofré
It's now OK: Michael published and announced the release. Sorry for the delay. Regards JB On 01/04/2016 10:06 AM, Jung wrote: Hi There were Spark 1.6 jars in maven central and github. I found it 5 days ago. But it doesn't appear on Spark website now. May I regard Spark 1.6 zip file in github

Re: email not showing up on the mailing list

2016-01-04 Thread Mattmann, Chris A (3980)
Moving user-owner to BCC. Hi Daniel, please: 1. send an email to user-subscr...@spark.apache.org. Wait for an automated reply that should let you know how to finish subscribing. 2. once done, post email to user@spark.apache.org from your email that you subscribed with in 1 and it should work

Re: Spark 1.4 RDD to DF fails with toDF()

2016-01-04 Thread Fab
Good catch, thanks. Things work now after changing the version. For reference, I got the" 2.11" version from my separate download of Scala: $ scala Welcome to Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67). But my Spark version is indeed running Scala "2.10": park-shell

subscribe

2016-01-04 Thread Suresh Thalamati

Spark Job Server with Yarn and Kerberos

2016-01-04 Thread Mike Wright
Has anyone used Spark Job Server on a "kerberized" cluster in YARN-Client mode? When Job Server contacts the YARN resource manager, we see a "Cannot impersonate root" error and am not sure what we have misconfigured. Thanks. ___ *Mike Wright* Principal Architect, Software