Re: spark-1.2.0--standalone-ha-zookeeper

2016-01-18 Thread doctorx
Hi, I am facing the same issue, with the given error ERROR Master:75 - Leadership has been revoked -- master shutting down. Can anybody help. Any clue will be useful. Should i change something in spark cluster or zookeeper. Is there any setting in spark which can help me? Thanks & Regards

Calling SparkContext methods in scala Future

2016-01-18 Thread Marco
Hello, I am using Spark 1.5.1 within SBT, and Scala 2.10.6 and I am facing an issue with the SparkContext. Basically, I have an object that needs to do several things: - call an external service One (web api) - call an external service Two (another api) - read and produce an RDD from HDFS

Re: SparkContext SyntaxError: invalid syntax

2016-01-18 Thread Andrew Weiner
Hi Felix, Yeah, when I try to build the docs using jekyll build, I get a LoadError (cannot load such file -- pygments) and I'm having trouble getting past it at the moment. >From what I could tell, this does not apply to YARN in client mode. I was able to submit jobs in client mode and they

Calling SparkContext methods in scala Future

2016-01-18 Thread makronized
I am using Spark 1.5.1 within SBT, and Scala 2.10.6 and I am writing because I am facing an issue with the SparkContext. Basically, I have an object that needs to do several things: - call an external service One (web api) - call an external service Two (another api) - read and produce an RDD

Re: Using JDBC clients with "Spark on Hive"

2016-01-18 Thread Ricardo Paiva
Are you running the Spark Thrift JDBC/ODBC server? In my environment I have a Hive Metastore server and the Spark Thrift Server pointing to the Hive Metastore. I use the Hive beeline tool for testing. With this setup I'm able to use Tableau connecting to Hive tables and using Spark SQL as the

Re: Spark Streaming: BatchDuration and Processing time

2016-01-18 Thread Ricardo Paiva
If you are using Kafka as the message queue, Spark will process accordingly the time slices, even if it is late, like in your example. But it will fail sometime, due the fact that your process will ask for a message that is older than the oldest message in Kafka. If your process takes longer than

Re: Spark streaming: Fixed time aggregation & handling driver failures

2016-01-18 Thread Ricardo Paiva
I don't know if this is the most efficient way to do that, but you can use a sliding window that is bigger than your aggregation period and filter only for the messages inside the period. Remember that to work with the reduceByKeyAndWindow you need to associate each row with the time key, in your

Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Daniel Darabos
Hi, How do you know it doesn't work? The log looks roughly normal to me. Is Spark not running at the printed address? Can you not start jobs? On Mon, Jan 18, 2016 at 11:51 AM, Oleg Ruchovets wrote: > Hi , >I try to follow the spartk 1.6.0 to install spark on EC2. > >

Re: Is there a test like MiniCluster example in Spark just like hadoop ?

2016-01-18 Thread Ted Yu
Please refer to the following suites: yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala core/src/test/scala/org/apache/spark/scheduler/SparkListenerWithClusterSuite.scala Cheers On Mon, Jan 18, 2016 at 2:14 AM, zml张明磊 wrote: > Hello, > > > >

Re: spark-1.2.0--standalone-ha-zookeeper

2016-01-18 Thread Ted Yu
Can you pastebin master log before the error showed up ? The initial message was posted for Spark 1.2.0 Which release of Spark / zookeeper do you use ? Thanks On Mon, Jan 18, 2016 at 6:47 AM, doctorx wrote: > Hi, > I am facing the same issue, with the given error >

spark ml Dataframe vs Labeled Point RDD Mllib speed

2016-01-18 Thread jarias
Hi all, I've been recently playing with the ml API in spark 1.6.0 as I'm in the process of implementing a series of new classifiers for my Phd thesis. There are some questions that have arisen regarding the scalability of the different data pipelines that can be used to load the training datasets

RE: Spark App -Yarn-Cluster-Mode ===> Hadoop_conf_**.zip file.

2016-01-18 Thread Siddharth Ubale
Hi, Thanks for pointing out the Phoenix discrepancy! I was using a phoenix 4.4. jar for hbase 1.1 release however I was using Hbase 0.98 . Have fixed the above issue. I am still unable to go ahead with the Streaming job in cluster mode with the following trace: Application

Re: Error in Spark Executors when trying to read HBase table from Spark with Kerberos enabled

2016-01-18 Thread Vinay Kashyap
Hi Guys, Any help regarding this issue..?? On Wed, Jan 13, 2016 at 6:39 PM, Vinay Kashyap wrote: > Hi all, > > I am using *Spark 1.5.1 in YARN cluster mode in CDH 5.5.* > I am trying to create an RDD by reading HBase table with kerberos enabled. > I am able to launch

spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Oleg Ruchovets
Hi , I try to follow the spartk 1.6.0 to install spark on EC2. It doesn't work properly - got exceptions and at the end standalone spark cluster installed. here is log information: Any suggestions? Thanks Oleg. oleg@robinhood:~/install/spark-1.6.0-bin-hadoop2.6/ec2$ ./spark-ec2

Is there a test like MiniCluster example in Spark just like hadoop ?

2016-01-18 Thread zml张明磊
Hello, I want to find some test file in spark which support the same function just like in Hadoop MiniCluster test environment. But I can not find them. Anyone know about that ?

Re: Spark Streaming on mesos

2016-01-18 Thread Iulian Dragoș
On Mon, Nov 30, 2015 at 4:09 PM, Renjie Liu wrote: > Hi, Lulian: > Please, it's Iulian, not Lulian. > Are you sure that it'll be a long running process in fine-grained mode? I > think you have a misunderstanding about it. An executor will be launched > for some tasks,

Extracting p values in Logistic regression using mllib scala

2016-01-18 Thread Chandan Verma
Hi, Can anyone help me to extract p-values from a logistic regression model using mllib and scala. Thanks Chandan Verma

[Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Jerry Lam
Hi spark users and developers, what do you do if you want the from_unixtime function in spark sql to return the timezone you want instead of the system timezone? Best Regards, Jerry

Re: simultaneous actions

2016-01-18 Thread Debasish Das
Simultaneous action works on cluster fine if they are independent...on local I never paid attention but the code path should be similar... On Jan 18, 2016 8:00 AM, "Koert Kuipers" wrote: > stacktrace? details? > > On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom

Spark SQL create table

2016-01-18 Thread raghukiran
Is creating a table using the SparkSQLContext currently supported? Regards, Raghu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-create-table-tp25996.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Calling SparkContext methods in scala Future

2016-01-18 Thread Ted Yu
externalCallTwo map { dataTwo => println("in map") // prints, so it gets here ... val rddOne = sparkContext.parallelize(dataOne) I don't think you should call method on sparkContext in map function. sparkContext lives on driver side. Cheers On Mon, Jan 18, 2016 at 6:27 AM,

Re: Serializing DataSets

2016-01-18 Thread Michael Armbrust
What error? On Mon, Jan 18, 2016 at 9:01 AM, Simon Hafner wrote: > And for deserializing, > `sqlContext.read.parquet("path/to/parquet").as[T]` and catch the > error? > > 2016-01-14 3:43 GMT+08:00 Michael Armbrust : > > Yeah, thats the best way for

Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Daniel Darabos
On Mon, Jan 18, 2016 at 5:24 PM, Oleg Ruchovets wrote: > I thought script tries to install hadoop / hdfs also. And it looks like it > failed. Installation is only standalone spark without hadoop. Is it correct > behaviour? > Yes, it also sets up two HDFS clusters. Are they

Re: Spark SQL create table

2016-01-18 Thread Ted Yu
Have you taken a look at sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala ? You can find examples there. On Mon, Jan 18, 2016 at 9:57 AM, raghukiran wrote: > Is creating a table using the SparkSQLContext currently supported? > > Regards, > Raghu > > > >

Re: Spark SQL create table

2016-01-18 Thread Ted Yu
Please take a look at sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDataFrameSuite.scala On Mon, Jan 18, 2016 at 9:57 AM, raghukiran wrote: > Is creating a table using the SparkSQLContext currently supported? > > Regards, > Raghu > > > > -- > View this message in

Re: Spark SQL create table

2016-01-18 Thread Raghu Ganti
This requires Hive to be installed and uses HiveContext, right? What is the SparkSQLContext useful for? On Mon, Jan 18, 2016 at 1:27 PM, Ted Yu wrote: > Please take a look > at sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDataFrameSuite.scala > > On Mon, Jan 18,

Re: Spark SQL create table

2016-01-18 Thread Raghu Ganti
Btw, Thanks a lot for all your quick responses - it is very useful and definitely appreciate it :-) On Mon, Jan 18, 2016 at 1:28 PM, Raghu Ganti wrote: > This requires Hive to be installed and uses HiveContext, right? > > What is the SparkSQLContext useful for? > > On Mon,

Re: Spark SQL create table

2016-01-18 Thread Ted Yu
By SparkSQLContext, I assume you mean SQLContext. >From the doc for SQLContext#createDataFrame(): * dataFrame.registerTempTable("people") * sqlContext.sql("select name from people").collect.foreach(println) If you want to persist table externally, you need Hive, etc Regards On Mon, Jan

Re: Calling SparkContext methods in scala Future

2016-01-18 Thread Shixiong(Ryan) Zhu
Hey Marco, Since the codes in Future is in an asynchronous way, you cannot call "sparkContext.stop" at the end of "fetch" because the codes in Future may not finish. However, the exception seems weird. Do you have a simple reproducer? On Mon, Jan 18, 2016 at 9:13 AM, Ted Yu

Re: Spark Streaming: Does mapWithState implicitly partition the dsteram?

2016-01-18 Thread Shixiong(Ryan) Zhu
mapWithState uses HashPartitioner by default. You can use "StateSpec.partitioner" to set your custom partitioner. On Sun, Jan 17, 2016 at 11:00 AM, Lin Zhao wrote: > When the state is passed to the task that handles a mapWithState for a > particular key, if the key is

Spark Summit East - Full Schedule Available

2016-01-18 Thread Scott walent
Join the Apache Spark community at the 2nd annual Spark Summit East from February 16-18, 2016 in New York City. We will kick things off with a Spark update from Matei Zaharia followed by over 60 talks that were selected by the program committee. The agenda this year includes enterprise talks from

How to call a custom function from GroupByKey which takes Iterable[Row] as input and returns a Map[Int,String] as output in scala

2016-01-18 Thread Neha Mehta
Hi, I have a scenario wherein my dataset has around 30 columns. It is basically user activity information. I need to group the information by each user and then for each column/activity parameter I need to find the percentage affinity for each value in that column for that user. Below is the

Re: simultaneous actions

2016-01-18 Thread Mennour Rostom
Hi, I am running my app in a single machine first before moving it in the cluster; actually simultaneous actions are not working for me now; is this comming from the fact that I am using a single machine ? yet I am using FAIR scheduler. 2016-01-17 21:23 GMT+01:00 Mark Hamstra

Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Oleg Ruchovets
I thought script tries to install hadoop / hdfs also. And it looks like it failed. Installation is only standalone spark without hadoop. Is it correct behaviour? Also errors in the log: ERROR: Unknown Tachyon version Error: Could not find or load main class crayondata.com.log Thanks Oleg.

Re: simultaneous actions

2016-01-18 Thread Koert Kuipers
stacktrace? details? On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom wrote: > Hi, > > I am running my app in a single machine first before moving it in the > cluster; actually simultaneous actions are not working for me now; is this > comming from the fact that I am using a

spark random forest regressor : argument minInstancesPerNode not accepted

2016-01-18 Thread Christopher Bourez
Dears, I'm trying to set the parameter 'minInstancesPerNode', it sounds like it is not working : model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},numTrees=2500, featureSubsetStrategy="sqrt",impurity='variance',minInstancesPerNode=1000) Traceback (most recent call

Re: Serializing DataSets

2016-01-18 Thread Simon Hafner
And for deserializing, `sqlContext.read.parquet("path/to/parquet").as[T]` and catch the error? 2016-01-14 3:43 GMT+08:00 Michael Armbrust : > Yeah, thats the best way for now (note the conversion is purely logical so > there is no cost of calling toDF()). We'll likely be

Re: Incorrect timeline for Scheduling Delay in Streaming page in web UI?

2016-01-18 Thread Shixiong(Ryan) Zhu
Hey, did you mean that the scheduling delay timeline is incorrect because it's too short and some values are missing? A batch won't have a scheduling delay until it starts to run. In your example, a lot of batches are waiting so that they don't have the scheduling delay. On Sun, Jan 17, 2016 at

Re: [Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Alexander Pivovarov
Look at to_utc_timestamp from_utc_timestamp On Jan 18, 2016 9:39 AM, "Jerry Lam" wrote: > Hi spark users and developers, > > what do you do if you want the from_unixtime function in spark sql to > return the timezone you want instead of the system timezone? > > Best

PySpark Broadcast of User Defined Class No Work?

2016-01-18 Thread efwalkermit
Should I be able to broadcast a fairly simple user-defined class? I'm having no success in 1.6.0 (or 1.5.2): $ cat test_spark.py import pyspark class SimpleClass: def __init__(self): self.val = 5 def get(self): return self.val def main(): sc =

Re: PySpark Broadcast of User Defined Class No Work?

2016-01-18 Thread Maciej Szymkiewicz
Python can pickle only objects not classes. It means that SimpleClass has to importable on every worker node to enable correct deserialization. Typically it means keeping class definitions in a separate module and distributing using for example --py-files. On 01/19/2016 12:34 AM, efwalkermit

Re: trouble using eclipse to view spark source code

2016-01-18 Thread Jakob Odersky
Have you followed the guide on how to import spark into eclipse https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse ? On 18 January 2016 at 13:04, Andy Davidson wrote: > Hi > > My project is implemented using Java

Re: Spark + Sentry + Kerberos don't add up?

2016-01-18 Thread Ruslan Dautkhanov
Hi Romain, Thank you for your response. Adding Kerberos support might be as simple as https://issues.cloudera.org/browse/LIVY-44 ? I.e. add Livy --principal and --keytab parameters to be passed to spark-submit. As a workaround I just did kinit (using hues' keytab) and then launched Livy Server.

Re: trouble using eclipse to view spark source code

2016-01-18 Thread Annabel Melongo
Andy, This has nothing to do with Spark but I guess you don't have the proper Scala version. The version you're currently running doesn't recognize a method in Scala ArrayOps, namely:          scala.collection.mutable.ArrayOps.$colon$plus On Monday, January 18, 2016 7:53 PM, Andy Davidson

Re: trouble using eclipse to view spark source code

2016-01-18 Thread Andy Davidson
Many thanks. I was using a different scala plug in. this one seems to work better I no longer get compile error how ever I get the following stack trace when I try to run my unit tests with mllib open I am still using eclipse luna. Andy java.lang.NoSuchMethodError:

Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Oleg Ruchovets
It looks spark is not working fine : I followed this link ( http://spark.apache.org/docs/latest/ec2-scripts.html. ) and I see spot instances installed on EC2. from spark shell I am counting lines and got connection exception. *scala> val lines = sc.textFile("README.md")* *scala> lines.count()*

rdd.foreach return value

2016-01-18 Thread charles li
code snippet ​ the 'print' actually print info on the worker node, but I feel confused where the 'return' value goes to. for I get nothing on the driver node. -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

Re: rdd.foreach return value

2016-01-18 Thread David Russell
The foreach operation on RDD has a void (Unit) return type. See attached. So there is no return value to the driver. David "All that is gold does not glitter, Not all those who wander are lost." Original Message Subject: rdd.foreach return value Local Time: January 18 2016

Re: rdd.foreach return value

2016-01-18 Thread Ted Yu
Here is signature for foreach: def foreach(f: T => Unit): Unit = withScope { I don't think you can return element in the way shown in the snippet. On Mon, Jan 18, 2016 at 7:34 PM, charles li wrote: > code snippet > > > ​ > the 'print' actually print info on the worker

How to call a custom function from GroupByKey which takes Iterable[Row] as input and returns a Map[Int,String] as output in scala

2016-01-18 Thread Neha Mehta
Hi, I have a scenario wherein my dataset has around 30 columns. It is basically user activity information. I need to group the information by each user and then for each column/activity parameter I need to find the percentage affinity for each value in that column for that user. Below is the

Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Peter Zhang
Could you run spark-shell at $SPARK_HOME DIR? You can try to change you command run at $SPARK_HOME or, point to README.md with full path. Peter Zhang --  Google Sent with Airmail On January 19, 2016 at 11:26:14, Oleg Ruchovets (oruchov...@gmail.com) wrote: It looks spark is not working fine

Spark Streaming - Latest batch-time can't keep up with current time

2016-01-18 Thread Collin Shi
Hi all, After having submit the job, the latest batch-time is almost as same as current time at first. Let's say, if current time is '12:00:00', then the latest batch-time would be '11:59:59'. But as time goes, the difference is getting greater and greater. For instance , current time is

SparkR with Hive integration

2016-01-18 Thread Peter Zhang
Hi all, http://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes From Hive tables You can also create SparkR DataFrames from Hive tables. To do this we will need to create a HiveContext which can access tables in the Hive MetaStore. Note that Spark should have been built with Hive

Re: SparkR with Hive integration

2016-01-18 Thread Peter Zhang
Thanks,  I will try. Peter --  Google Sent with Airmail On January 19, 2016 at 12:44:46, Jeff Zhang (zjf...@gmail.com) wrote: Please make sure you export environment variable HADOOP_CONF_DIR which contains the core-site.xml On Mon, Jan 18, 2016 at 8:23 PM, Peter Zhang

Re: rdd.foreach return value

2016-01-18 Thread charles li
hi, great thanks to david and ted, I know that the content of RDD can be returned to driver using 'collect' method. but my question is: 1. cause we can write any code we like in the function put into 'foreach', so what happened when we actually write a 'return' sentence in the foreach function?

Re: rdd.foreach return value

2016-01-18 Thread Ted Yu
For #2, RDD is immutable. > On Jan 18, 2016, at 8:10 PM, charles li wrote: > > > hi, great thanks to david and ted, I know that the content of RDD can be > returned to driver using 'collect' method. > > but my question is: > > > 1. cause we can write any code we

Re: rdd.foreach return value

2016-01-18 Thread Vishal Maru
1. foreach doesn't expect any value from function being passed (in your func_foreach). so nothing happens. The return values are just lost. it's like calling a function without saving return value to another var. foreach also doesn't return anything so you don't get modified RDD (like map*). 2.

Re: rdd.foreach return value

2016-01-18 Thread charles li
got it, great thanks, Vishal, Ted and David On Tue, Jan 19, 2016 at 1:10 PM, Vishal Maru wrote: > 1. foreach doesn't expect any value from function being passed (in your > func_foreach). so nothing happens. The return values are just lost. it's > like calling a function

Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Oleg Ruchovets
I am running from $SPARK_HOME. It looks like connection problem to port 9000. It is on master machine. What is this process is spark tries to connect? Should I start any framework , processes before executing spark? Thanks OIeg. 16/01/19 03:17:56 INFO ipc.Client: Retrying connect to

Re: rdd.foreach return value

2016-01-18 Thread charles li
thanks, david and ted, I know that the content of RDD can be returned to driver using `collect ​ On Tue, Jan 19, 2016 at 11:44 AM, Ted Yu wrote: > Here is signature for foreach: > def foreach(f: T => Unit): Unit = withScope { > > I don't think you can return element in

Re: SparkR with Hive integration

2016-01-18 Thread Jeff Zhang
Please make sure you export environment variable HADOOP_CONF_DIR which contains the core-site.xml On Mon, Jan 18, 2016 at 8:23 PM, Peter Zhang wrote: > Hi all, > > http://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes > From Hive tables >

Re: Spark 1.6.0, yarn-shuffle

2016-01-18 Thread johd
Hi, No, i have not. :-/ Regards, J -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-0-yarn-shuffle-tp25961p26002.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: rdd.foreach return value

2016-01-18 Thread Darren Govoni
What's the rationale behind that? It certainly limits the kind of flow logic we can do in one statement. Sent from my Verizon Wireless 4G LTE smartphone Original message From: David Russell Date: 01/18/2016 10:44 PM (GMT-05:00) To:

Re: Number of CPU cores for a Spark Streaming app in Standalone mode

2016-01-18 Thread radoburansky
I am adding an answer from SO: http://stackoverflow.com/questions/34861947/read-more-kafka-topics-than-number-of-cpu-cores -- View this message in context:

RE: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread vivek.meghanathan
Have you verified the spark master/slaves are started correctly? Please check using netstat command and open ports mode. Are they listening? Binds to which address etc.. From: Oleg Ruchovets [mailto:oruchov...@gmail.com] Sent: 19 January 2016 11:24 To: Peter Zhang Cc:

when enable kerberos in hdp, the spark does not work

2016-01-18 Thread 李振
when enable kerberos in hdp, the spark does not work,the error is follow: Traceback (most recent call last): File "/home/lizhen/test.py", line 27, in abc = raw_data.count() File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1006,

a problem about using UDF at sparksql

2016-01-18 Thread ??????
Hi, everyone. Can anyone help to have a look at this problem? Now I am migrating from HIVE to Sparksql. But I encountered a problem. the temporary udf function which extends GenericUDF can create, and you can describe this udf function. but when I use this function, some exception occurs. I

using spark context in map funciton TASk not serilizable error

2016-01-18 Thread gpatcham
Hi, I have a use case where I need to pass sparkcontext in map function reRDD.map(row =>method1(row,sc)).saveAsTextFile(outputDir) Method1 needs spark context to query cassandra. But I see below error java.io.NotSerializableException: org.apache.spark.SparkContext Is there a way we can fix

Re: using spark context in map funciton TASk not serilizable error

2016-01-18 Thread Ted Yu
Can you pass the properties which are needed for accessing Cassandra without going through SparkContext ? SparkContext isn't designed to be used in the way illustrated below. Cheers On Mon, Jan 18, 2016 at 12:29 PM, gpatcham wrote: > Hi, > > I have a use case where I need

Re: Incorrect timeline for Scheduling Delay in Streaming page in web UI?

2016-01-18 Thread Jacek Laskowski
Hi Ryan, Ah, you might be right! I didn't think about the batches queued up (and hence without scheduling delay since they're not started yet). Thanks a lot for responding! Pozdrawiam, Jacek Jacek Laskowski | https://medium.com/@jaceklaskowski/ Mastering Apache Spark ==>

Re: Spark SQL create table

2016-01-18 Thread Raghu Ganti
Great, I got that to work following your example! Thanks. A followup question is: If I had a custom SQL type (UserDefinedType), how can I map it to this type from the RDD in the DataFrame? Regards On Mon, Jan 18, 2016 at 1:35 PM, Ted Yu wrote: > By SparkSQLContext, I

is recommendProductsForUsers available in ALS?

2016-01-18 Thread Roberto Pagliari
With Spark 1.5, the following code: from pyspark import SparkContext, SparkConf from pyspark.mllib.recommendation import ALS, Rating r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) ratings = sc.parallelize([r1, r2, r3]) model = ALS.trainImplicit(ratings, 1, seed=10)

Re: 答复: 答复: 答复: 答复: spark streaming context trigger invoke stop why?

2016-01-18 Thread Shixiong(Ryan) Zhu
I see. There is a bug in 1.4.1 that a thread pool is not set the daemon flag for threads ( https://github.com/apache/spark/commit/346209097e88fe79015359e40b49c32cc0bdc439#diff-25124e4f06a1da237bf486eceb1f7967L47 ) So in 1.4.1, even if your main thread exits, threads in the thread pool is still

Re: [Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Jerry Lam
Thanks Alex: So you suggested something like: from_utc_timestamp(to_utc_timestamp(from_unixtime(1389802875),'America/Montreal'), 'America/Los_Angeles')? This is a lot of conversion :) Is there a particular reason not to have from_unixtime to take timezone information? I think I will make a UDF

Re: [Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Alexander Pivovarov
If you can find the function in Oracle or Mysql or Postgress which works better then we can create similar one. Timezone convertion is tricky because of daylight saving time. so better to use UTC without dst in database/DW On Jan 18, 2016 1:24 PM, "Jerry Lam" wrote: >

Re: using spark context in map funciton TASk not serilizable error

2016-01-18 Thread Giri P
I'm using spark cassandra connector to do this and the way we access cassandra table is sc.cassandraTable("keySpace", "tableName") Thanks Giri On Mon, Jan 18, 2016 at 12:37 PM, Ted Yu wrote: > Can you pass the properties which are needed for accessing Cassandra > without

Re: using spark context in map funciton TASk not serilizable error

2016-01-18 Thread Giri P
Can we use @transient ? On Mon, Jan 18, 2016 at 12:44 PM, Giri P wrote: > I'm using spark cassandra connector to do this and the way we access > cassandra table is > > sc.cassandraTable("keySpace", "tableName") > > Thanks > Giri > > On Mon, Jan 18, 2016 at 12:37 PM, Ted Yu

Re: building spark 1.6 throws error Rscript: command not found

2016-01-18 Thread Ted Yu
Please see: http://www.jason-french.com/blog/2013/03/11/installing-r-in-linux/ On Mon, Jan 18, 2016 at 1:22 PM, Mich Talebzadeh wrote: > ./make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.6 > -Phive -Phive-thriftserver -Pyarn > > > > > > INFO] ---

Number of CPU cores for a Spark Streaming app in Standalone mode

2016-01-18 Thread radoburansky
I somehow don't want to believe this waste of resources. Is it really true that if I have 20 input streams I must have at least 21 CPU cores? Even if I read only once per minute and only a few messages? I still hope that I miss an important information. Thanks a lot -- View this message in

trouble using eclipse to view spark source code

2016-01-18 Thread Andy Davidson
Hi My project is implemented using Java 8 and Python. Some times its handy to look at the spark source code. For unknown reason if I open a spark project my java projects show tons of compiler errors. I think it may have something to do with Scala. If I close the projects my java code is fine.

Re: using spark context in map funciton TASk not serilizable error

2016-01-18 Thread Ted Yu
Did you mean constructing SparkContext on the worker nodes ? Not sure whether that would work. Doesn't seem to be good practice. On Mon, Jan 18, 2016 at 1:27 PM, Giri P wrote: > Can we use @transient ? > > > On Mon, Jan 18, 2016 at 12:44 PM, Giri P

Re: using spark context in map funciton TASk not serilizable error

2016-01-18 Thread Giri P
yes I tried doing that but that doesn't work. I'm looking at using SQLContext and dataframes. Is SQLCOntext serializable? On Mon, Jan 18, 2016 at 1:29 PM, Ted Yu wrote: > Did you mean constructing SparkContext on the worker nodes ? > > Not sure whether that would work. > >

Re: Number of CPU cores for a Spark Streaming app in Standalone mode

2016-01-18 Thread Tathagata Das
If you are using receiver-based input streams, then you have to dedicate 1 core to each receiver. If you read only once per minute on each receiver, than consider consolidating the data reading pipeline such that you can use fewer receivers. On Mon, Jan 18, 2016 at 12:13 PM, radoburansky

Re: has any one implemented TF_IDF using ML transformers?

2016-01-18 Thread Andy Davidson
Hi Yanbo I am using 1.6.0. I am having a hard of time trying to figure out what the exact equation is. I do not know Scala. I took a look a the source code URL you provide. I do not know Scala override def transform(dataset: DataFrame): DataFrame = { transformSchema(dataset.schema, logging =

building spark 1.6 throws error Rscript: command not found

2016-01-18 Thread Mich Talebzadeh
./make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn INFO] --- exec-maven-plugin:1.4.0:exec (sparkr-pkg) @ spark-core_2.10 --- ../R/install-dev.sh: line 40: Rscript: command not found [INFO]

Re: using spark context in map funciton TASk not serilizable error

2016-01-18 Thread Ted Yu
class SQLContext private[sql]( @transient val sparkContext: SparkContext, @transient protected[sql] val cacheManager: CacheManager, @transient private[sql] val listener: SQLListener, val isRootContext: Boolean) extends org.apache.spark.Logging with Serializable { FYI On Mon,