回复:Re: SparkStreaming batch processing time question

2015-03-31 Thread luohui20001
hummm, got it. Thank you Akhil. Thanks&Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:Akhil Das 收件人:罗辉 抄送人:user 主题:Re: SparkStreaming batch processing time question 日期:2015年04月01日 14点31分 It will add scheduling delay for the new batch. The new batch da

Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-03-31 Thread twinkle sachdeva
Hi, In spark over YARN, there is a property "spark.yarn.max.executor.failures" which controls the maximum number of executor's failure an application will survive. If number of executor's failures ( due to any reason like OOM or machine failure etc ), increases this value then applications quits.

Re: --driver-memory parameter doesn't work for spark-submmit on yarn?

2015-03-31 Thread Akhil Das
Once you submit the job do a ps aux | grep spark-submit and see how much is the heap space allocated to the process (the -Xmx params), if you are seeing a lower value you could try increasing it yourself with: export _JAVA_OPTIONS="-Xmx5g" Thanks Best Regards On Wed, Apr 1, 2015 at 1:57 AM, Shua

Re: Re: rdd.cache() not working ?

2015-03-31 Thread fightf...@163.com
Hi That is just the issue. After running person.cache we then run person.count however, there still not be any cache performance showed from web ui storage. Thanks, Sun. fightf...@163.com From: Taotao.Li Date: 2015-04-01 14:02 To: fightfate CC: user Subject: Re: rdd.cache() not working ? r

Re: SparkStreaming batch processing time question

2015-03-31 Thread Akhil Das
It will add scheduling delay for the new batch. The new batch data will be processed after finish up the previous batch, when the time is too high, sometimes it will throw fetch failures as the batch data could get removed from memory. Thanks Best Regards On Wed, Apr 1, 2015 at 11:35 AM, wrote:

Re: Using 'fair' scheduler mode

2015-03-31 Thread Raghavendra Pandey
I am facing the same issue. FAIR and FIFO behaving in the same way. On Wed, Apr 1, 2015 at 1:49 AM, asadrao wrote: > Hi, I am using the Spark ‘fair’ scheduler mode. I have noticed that if the > first query is a very expensive query (ex: ‘select *’ on a really big data > set) than any subsequent

Re: When do map how to get the line number?

2015-03-31 Thread jitesh129
You can use zipWithIndex() to get index for each record and then you can increment by 1 for each index. val tf=sc.textFile("test").zipWithIndex() tf.map(s=>(s[1]+1,s[0])) Above should serve your purpose. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/When

SparkStreaming batch processing time question

2015-03-31 Thread luohui20001
hi guys: I got a question when reading http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-interval. What will happen to the streaming data if the batch processing time is bigger than the batch interval? Will the next batch data be dal

Re: rdd.cache() not working ?

2015-03-31 Thread Taotao.Li
rerun person.count and you will see the performance of cache. person.cache would not cache it right now. It'll actually cache this RDD after one action[person.count here] - 原始邮件 - 发件人: fightf...@163.com 收件人: "user" 发送时间: 星期三, 2015年 4 月 01日 下午 1:21:25 主题: rdd.cache() not working ?

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Sean Bigdatafun
(resending...) I was thinking the same setup… But the more I think of this problem, and the more interesting this could be. If we allocate 50% total memory to Tachyon statically, then the Mesos benefits of dynamically scheduling resources go away altogether. Can Tachyon be resource managed by Me

rdd.cache() not working ?

2015-03-31 Thread fightf...@163.com
Hi, all Running the following code snippet through spark-shell, however cannot see any cached storage partitions in web ui. Does this mean that cache now working ? Cause if we issue person.count again that we cannot say any time consuming performance upgrading. Hope anyone can explain this for

Re: Broadcasting a parquet file using spark and python

2015-03-31 Thread Jitesh chandra Mishra
Hi Michael, Thanks for your response. I am running 1.2.1. Is there any workaround to achieve the same with 1.2.1? Thanks, Jitesh On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust wrote: > In Spark 1.3 I would expect this to happen automatically when the parquet > table is small (< 10mb, confi

Creating Partitioned Parquet Tables via SparkSQL

2015-03-31 Thread Denny Lee
Creating Parquet tables via .saveAsTable is great but was wondering if there was an equivalent way to create partitioned parquet tables. Thanks!

Anatomy of RDD : Deep dive into RDD data structure

2015-03-31 Thread madhu phatak
Hi, Recently I gave a talk on RDD data structure which gives in depth understanding of spark internals. You can watch it on youtube . Also slides are on slideshare and code is on github

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-31 Thread Xiangrui Meng
I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-6637. Since we don't have a clear answer about how the scaling should be handled. Maybe the best solution for now is to switch back to the 1.2 scaling. -Xiangrui On Tue, Mar 31, 2015 at 2:50 PM, Sean Owen wrote: > Ah yeah I ta

Re: Actor not found

2015-03-31 Thread Shixiong Zhu
Thanks for the log. It's really helpful. I created a JIRA to explain why it will happen: https://issues.apache.org/jira/browse/SPARK-6640 However, will this error always happens in your environment? Best Regards, Shixiong Zhu 2015-03-31 22:36 GMT+08:00 sparkdi : > This is the whole output from

Spark SQL saveAsParquet failed after a few waves

2015-03-31 Thread Yijie Shen
Hi, I am using spark-1.3 prebuilt release with hadoop2.4 support and Hadoop 2.4.0. I wrote a spark application(LoadApp) to generate data in each task and load the data into HDFS as parquet Files (use “saveAsParquet()” in spark sql) When few waves (1 or 2) are used in a job, LoadApp could finish

Minimum slots assigment to Spark on Mesos

2015-03-31 Thread Stratos Dimopoulos
Hi All, I am running Spark & MR on Mesos. Is there a configuration setting for Spark to define the minimum required slots (similar to MapReduce's mapred.mesos.total.reduce.slots.minimum and mapred.mesos.total.map.slots. minimum)? The most related property I see is this: spark.scheduler. minRegiste

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Haoyuan Li
Ankur, Response inline. On Tue, Mar 31, 2015 at 4:49 PM, Ankur Chauhan wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Hi Haoyuan, > > So on each mesos slave node I should allocate/section off some amount > of memory for tachyon (let's say 50% of the total memory) and the rest > fo

Re: Can't run spark-submit with an application jar on a Mesos cluster

2015-03-31 Thread seglo
Thanks hbogert. There it is plain as day; it can't find my spark binaries. I thought it was enough to set SPARK_EXECUTOR_URI in my spark-env.sh since this is all that's necessary to run spark-shell.sh against a mesos master, but I also had to set spark.executor.uri in my spark-defaults.conf (or i

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
In my experiment, if I do not call gc() explicitly, the shuffle files will not be cleaned until the whole job finish… I don’t know why, maybe the rdd could not be GCed implicitly. In my situation, a full gc in driver takes about 10 seconds, so I start a thread in driver to do GC like this : (do

Re: spark.sql.Row manipulation

2015-03-31 Thread Michael Armbrust
You can do something like: df.collect().map { case Row(name: String, age1: Int, age2: Int) => ... } On Tue, Mar 31, 2015 at 4:05 PM, roni wrote: > I have 2 paraquet files with format e.g name , age, town > I read them and then join them to get all the names which are in both > towns . > t

RE: SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread java8964
You can use the HiveContext instead of SQLContext, which should support all the HiveQL, including lateral view explode. SQLContext is not supporting that yet. BTW, nice coding format in the email. Yong Date: Tue, 31 Mar 2015 18:18:19 -0400 Subject: Re: SparkSql - java.util.NoSuchElementException:

Re: Query REST web service with Spark?

2015-03-31 Thread Ashish Rangole
All you need is a client to the target REST service in your Spark task. It could be as simple as a HttpClient. Most likely that client won't be serializable in which case you initialize it lazily. There are useful examples in Spark knowledge base gitbook that you can look at. On Mar 31, 2015 1:48 P

Re: Query REST web service with Spark?

2015-03-31 Thread Todd Nist
Here are a few ways to achieve what your loolking to do: https://github.com/cjnolet/spark-jetty-server Spark Job Server - https://github.com/spark-jobserver/spark-jobserver - defines a REST API for Spark Hue - http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-you

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Haoyuan, So on each mesos slave node I should allocate/section off some amount of memory for tachyon (let's say 50% of the total memory) and the rest for regular mesos tasks? This means, on each slave node I would have tachyon worker (+ hdfs confi

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Haoyuan Li
Tachyon should be co-located with Spark in this case. Best, Haoyuan On Tue, Mar 31, 2015 at 4:30 PM, Ankur Chauhan wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Hi, > > I am fairly new to the spark ecosystem and I have been trying to setup > a spark on mesos deployment. I can't

deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am fairly new to the spark ecosystem and I have been trying to setup a spark on mesos deployment. I can't seem to figure out the "best practices" around HDFS and Tachyon. The documentation about Spark's data-locality section seems to point that

Re: How to configure SparkUI to use internal ec2 ip

2015-03-31 Thread Anny Chen
Thanks Petar and Akhil for the suggestion. Actually I changed the SPARK_MASTER_IP to the internal-ip, deleted the "export SPARK_PUBLIC_DNS=xx" line in the spark-env.sh and also edited the /etc/hosts as Akhil suggested, and now it is working! However I don't know which change actually makes it

spark.sql.Row manipulation

2015-03-31 Thread roni
I have 2 paraquet files with format e.g name , age, town I read them and then join them to get all the names which are in both towns . the resultant dataset is res4: Array[org.apache.spark.sql.Row] = Array([name1, age1, town1,name2,age2,town2]) Name 1 and name 2 are same as I am joining .

Re: SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread Todd Nist
So in looking at this a bit more, I gather the root cause is the fact that the nested fields are represented as rows within rows, is that correct? If I don't know the size of the json array (it varies), using x.getAs[Row](0).getString(0) is not really a valid solution. Is the solution to apply a

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-31 Thread Sean Owen
Ah yeah I take your point. The squared error term is over the whole user-item matrix, technically, in the implicit case. I suppose I am used to assuming that the 0 terms in this matrix are weighted so much less (because alpha is usually large-ish) that they're almost not there, but they are. So I h

Re: why "Shuffle Write" is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Bijay Pathak
Hi Udit, The persisted RDDs in memory is cleared by Spark using LRU policy and you can also set the time to clear the persisted RDDs and meta-data by setting* spark.cleaner.ttl *(default infinite). But I am not aware about any properties to clean the older shuffle write from from disks. thanks, b

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-31 Thread Xiangrui Meng
Hey Sean, That is true for explicit model, but not for implicit. The ALS-WR paper doesn't cover the implicit model. In implicit formulation, a sub-problem (for v_j) is: min_{v_j} \sum_i c_ij (p_ij - u_i^T v_j)^2 + lambda * X * \|v_j\|_2^2 This is a sum for all i but not just the users who rate i

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-03-31 Thread Xiangrui Meng
I cannot reproduce this error on master, but I'm not aware of any recent bug fixes that are related. Could you build and try the current master? -Xiangrui On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa wrote: > Hi all, > > DataFrame with an user defined type (here mllib.Vector) created with >

joining multiple parquet files

2015-03-31 Thread roni
Hi , I have 4 parquet files and I want to find data which is common in all of them e.g SELECT TableA.*, TableB.*, TableC.*, TableD.* FROM (TableB INNER JOIN TableA ON TableB.aID= TableA.aID) INNER JOIN TableC ON(TableB.cID= Tablec.cID) INNER JOIN TableA ta ON(ta.dID= TableD.dID) W

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread Xiangrui Meng
Hey Guoqiang and Sendong, Could you comment on the overhead of calling gc() explicitly? The shuffle files should get cleaned in a few seconds after checkpointing, but it is certainly possible to accumulates TBs of files in a few seconds. In this case, calling gc() may work the same as waiting for

Re: Query REST web service with Spark?

2015-03-31 Thread Burak Yavuz
Hi, If I recall correctly, I've read people integrating REST calls to Spark Streaming jobs in the user list. I don't imagine any cases for why it shouldn't be possible. Best, Burak On Tue, Mar 31, 2015 at 1:46 PM, Minnow Noir wrote: > We have have some data on Hadoop that needs augmented with

Did anybody run Spark-perf on powerpc?

2015-03-31 Thread Tom
We verified it runs on x86, and are now trying to run it on powerPC. We currently run into dependency trouble with sbt. I tried installing sbt by hand and resolving all dependencies by hand, but must have made an error, as I still get errors. Original error: Getting org.scala-sbt sbt 0.13.6 ... :

Re: why "Shuffle Write" is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Udit Mehta
Thanks for the reply. This will reduce the shuffle write to disk to an extent but for a long running job(multiple days), the shuffle write would still occupy a lot of space on disk. Why do we need to store the data from older map tasks to memory? On Tue, Mar 31, 2015 at 1:19 PM, Bijay Pathak wrot

Query REST web service with Spark?

2015-03-31 Thread Minnow Noir
We have have some data on Hadoop that needs augmented with data only available to us via a REST service. We're using Spark to search for, and correct, missing data. Even though there are a lot of records to scour for missing data, the total number of calls to the service is expected to be low, so

Re: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-03-31 Thread Ted Yu
Jeetendra: Please extract the information you need from Result and return the extracted portion - instead of returning Result itself. Cheers On Tue, Mar 31, 2015 at 1:14 PM, Nan Zhu wrote: > The example in > https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/ex

--driver-memory parameter doesn't work for spark-submmit on yarn?

2015-03-31 Thread Shuai Zheng
Hi All, Below is the my shell script: /home/hadoop/spark/bin/spark-submit --driver-memory=5G --executor-memory=40G --master yarn-client --class com.***.FinancialEngineExecutor /home/hadoop/lib/my.jar s3://bucket/vriscBatchConf.properties My driver will load some resources and then broa

Using 'fair' scheduler mode

2015-03-31 Thread asadrao
Hi, I am using the Spark ‘fair’ scheduler mode. I have noticed that if the first query is a very expensive query (ex: ‘select *’ on a really big data set) than any subsequent query seem to get blocked. I would have expected the second query to run in parallel since I am using the ‘fair’ scheduler m

Re: why "Shuffle Write" is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Bijay Pathak
The Spark Sort-Based Shuffle (default from 1.1) keeps the data from each Map tasks to memory until they they can't fit after which they are sorted and spilled to disk. You can reduce the Shuffle write to disk by increasing spark.shuffle.memoryFraction(default 0.2). By writing the shuffle output to

Re: "Spark-events does not exist" error, while it does with all the req. rights

2015-03-31 Thread Marcelo Vanzin
Hmmm... could you try to set the log dir to "file:/home/hduser/spark/spark-events"? I checked the code and it might be the case that the behaviour changed between 1.2 and 1.3... On Mon, Mar 30, 2015 at 6:44 PM, Tom Hubregtsen wrote: > The stack trace for the first scenario and your suggested imp

Re: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-03-31 Thread Nan Zhu
The example in https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala might help Best, -- Nan Zhu http://codingcat.me On Tuesday, March 31, 2015 at 3:56 PM, Sean Owen wrote: > Yep, it's not serializable: > https://hbase.apache.org/apid

Re: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-03-31 Thread Sean Owen
Yep, it's not serializable: https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html You can't return this from a distributed operation since that would mean it has to travel over the network and you haven't supplied any way to convert the thing into bytes. On Tue, Mar 31, 2015

java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-03-31 Thread Jeetendra Gangele
When I am trying to get the result from Hbase and running mapToPair function of RRD its giving the error java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result Here is the code // private static JavaPairRDD getCompanyDataRDD(JavaSparkContext sc) throws IOException { // return sc.

Re: why "Shuffle Write" is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Udit Mehta
I have noticed a similar issue when using spark streaming. The spark shuffle write size increases to a large size(in GB) and then the app crashes saying: java.io.FileNotFoundException: /data/vol0/nodemanager/usercache/$user/appcache/application_1427480955913_0339/spark-local-20150330231234-db1a/0b/

SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread Todd Nist
I am accessing ElasticSearch via the elasticsearch-hadoop and attempting to expose it via SparkSQL. I am using spark 1.2.1, latest supported by elasticsearch-hadoop, and "org.elasticsearch" % "elasticsearch-hadoop" % "2.1.0.BUILD-SNAPSHOT" of elasticsearch-hadoop. I’m encountering an issue when I a

Re: Broadcasting a parquet file using spark and python

2015-03-31 Thread Michael Armbrust
In Spark 1.3 I would expect this to happen automatically when the parquet table is small (< 10mb, configurable with spark.sql.autoBroadcastJoinThreshold). If you are running 1.3 and not seeing this, can you show the code you are using to create the table? On Tue, Mar 31, 2015 at 3:25 AM, jitesh129

Re: How to configure SparkUI to use internal ec2 ip

2015-03-31 Thread Anny Chen
Hi Akhil, Thanks for the explanation! I could ping the worker from the master using either host name or internal-ip, however I am a little confused why setting SPARK_LOCAL_IP would help? Thanks! Anny On Tue, Mar 31, 2015 at 10:36 AM, Akhil Das wrote: > When you say you added , where you able

Re: How to configure SparkUI to use internal ec2 ip

2015-03-31 Thread Akhil Das
When you say you added , where you able to ping any of these from the machine? You could try setting SPARK_LOCAL_IP on all machines. But make sure you will be able to bind to that host/ip specified there. Thanks Best Regards On Tue, Mar 31, 2015 at 10:49 PM, Anny Chen wrote: > Hi Akhil, > >

Re: How to setup a Spark Cluter?

2015-03-31 Thread Akhil Das
Its pretty simple, pick one machine as master (say machine A), and lets call the workers are B,C, and D *Login to A:* - Enable passwd less authentication (ssh-keygen) - Add A's ~/.ssh/id_rsa.pub to B,C,D's ~/.ssh/authorized_keys file - Download spark binary (that supports your hadoop version)

Re: How to configure SparkUI to use internal ec2 ip

2015-03-31 Thread Petar Zecevic
Did you try setting the SPARK_MASTER_IP parameter in spark-env.sh? On 31.3.2015. 19:19, Anny Chen wrote: Hi Akhil, I tried editing the /etc/hosts on the master and on the workers, and seems it is not working for me. I tried adding and it didn't work. I then tried adding and it didn't

Re: How to configure SparkUI to use internal ec2 ip

2015-03-31 Thread Anny Chen
Hi Akhil, I tried editing the /etc/hosts on the master and on the workers, and seems it is not working for me. I tried adding and it didn't work. I then tried adding and it didn't work either. I guess I should also edit the spark-env.sh file? Thanks! Anny On Mon, Mar 30, 2015 at 11:15 PM, A

How to setup a Spark Cluter?

2015-03-31 Thread bhushansc007
Hi All, I am quite new to Spark. So please pardon me if it is a very basic question. I have setup a Hadoop cluster using Hortonwork's Ambari. It has 1 Master and 3 Worker nodes. Currently, it has HDFS, Yarn, MapReduce2, HBase and ZooKeeper services installed. Now, I want to install Spark on it

Re: refer to dictionary

2015-03-31 Thread Peng Xia
Hi Ted, Thanks very much, yea, using broadcast is much faster. Best, Peng On Tue, Mar 31, 2015 at 8:49 AM, Ted Yu wrote: > You can use broadcast variable. > > See also this thread: > > http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variable&subj=How+Broadcast+variable+scale+ > > > > >

"Ambiguous references" to a field set in a partitioned table AND the data

2015-03-31 Thread nfo
Hi, I save Parquet files in a partitioned table, so in a path looking like /path/to/table/myfield=a/ . But I also kept the field "myfield" in the Parquet data. Thus. when I query the field, I get this error: df.select("myfield").show(10) "Exception in thread "main" org.apache.spark.sql.AnalysisEx

Hygienic closures for scala function serialization

2015-03-31 Thread Erik Erlandson
Under certain conditions, Scala will pull an entire class instance into a closure instead of just particular value members of that class, which can knee-cap Spark's serialization of functions in multiple ways. I tripped over this last week, and wrote up the experience on my blog. I've since n

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
guoqiang ??s method works very well ?? it only takes 1TB disk now. thank you very much! > ?? 2015??3??314:47??GuoQiang Li ?? > > You can try to enforce garbage collection: > > /** Run GC and make sure it actually has run */ > def runGC() { > val weakRef = new WeakReference(new

Re: Parquet Hive table become very slow on 1.3?

2015-03-31 Thread Zheng, Xudong
Thanks Cheng! Set 'spark.sql.parquet.useDataSourceApi' to false resolves my issues, but the PR 5231 seems not. Not sure any other things I did wrong ... BTW, actually, we are very interested in the schema merging feature in Spark 1.3, so both these two solution will disable this feature, right? I

pyspark error with zip

2015-03-31 Thread Charles Hayden
? The following program fails in the zip step. x = sc.parallelize([1, 2, 3, 1, 2, 3]) y = sc.parallelize([1, 2, 3]) z = x.distinct() print x.zip(y).collect() The error that is produced depends on whether multiple partitions have been specified or not. I understand that the two RDDs [must] ha

Re: Actor not found

2015-03-31 Thread sparkdi
This is the whole output from the shell: ~/spark-1.3.0-bin-hadoop2.4$ sudo bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please in

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-31 Thread Vincent He
It works,thanks for your great help. On Mon, Mar 30, 2015 at 10:07 PM, Denny Lee wrote: > Hi Vincent, > > This may be a case that you're missing a semi-colon after your CREATE > TEMPORARY TABLE statement. I ran your original statement (missing the > semi-colon) and got the same error as you did

Re: Spark streaming with Kafka, multiple partitions fail, single partition ok

2015-03-31 Thread Ted Yu
Can you show us the output of DStream#print() if you have it ? Thanks On Tue, Mar 31, 2015 at 2:55 AM, Nicolas Phung wrote: > Hello, > > @Akhil Das I'm trying to use the experimental API > https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/s

"Ambiguous references" to a field set in a partitioned table AND the data

2015-03-31 Thread Nicolas Fouché
Hi, I save Parquet files in a partitioned table, so in /path/to/table/myfield=a/ . But I also kept the field "myfield" in the Parquet data. Thus. when I query the field, I get this error: df.select("myfield").show(10) "Exception in thread "main" org.apache.spark.sql.AnalysisException: Ambigu

Re: can't union two rdds

2015-03-31 Thread ankurjain.nitrr
Rdd union will result in 1 2 3 4 5 6 7 8 9 10 11 12 What you are trying to do is join. There must be a logic/key to perform join operation. I think in your case you want the order (index) to be the joining key here. RDD is a distributed data structure and is not apt for your cas

Spark sql query fails with executor lost/ out of memory expection while caching a table

2015-03-31 Thread ankurjain.nitrr
Hi, I am using spark 1.2.1 I am using thrift server to query my data. while executing query "CACHE TABLE tablename" Fails with exception Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1

Re: can't union two rdds

2015-03-31 Thread roy
use zip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22321.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-

Re: refer to dictionary

2015-03-31 Thread Ted Yu
You can use broadcast variable. See also this thread: http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variable&subj=How+Broadcast+variable+scale+ > On Mar 31, 2015, at 4:43 AM, Peng Xia wrote: > > Hi, > > I have a RDD (rdd1)where each line is split into an array ["a", "b", "c], etc.

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-31 Thread Sean Owen
I had always understood the formulation to be the first option you describe. Lambda is scaled by the number of items the user has rated / interacted with. I think the goal is to avoid fitting the tastes of prolific users disproportionately just because they have many ratings to fit. This is what's

Re: Can't run spark-submit with an application jar on a Mesos cluster

2015-03-31 Thread hbogert
Well that are only the logs of the slaves on mesos level, I'm not sure from your reply if you can ssh into a specific slave or not, if you can, you should look at actual output of the application (spark in this case) on a slave in e.g. /tmp/mesos/slaves/20150322-040336-606645514-5050-2744-S1/fr

refer to dictionary

2015-03-31 Thread Peng Xia
Hi, I have a RDD (rdd1)where each line is split into an array ["a", "b", "c], etc. And I also have a local dictionary p (dict1) stores key value pair {"a":1, "b": 2, c:3} I want to replace the keys in the rdd with the its corresponding value in the dict: rdd1.map(lambda line: [dict1[item] for item

Re: spark there is no space on the disk

2015-03-31 Thread Peng Xia
Yes, we have just modified the configuration, and every thing works fine. Thanks very much for the help. On Thu, Mar 19, 2015 at 5:24 PM, Ted Yu wrote: > For YARN, possibly this one ? > > > yarn.nodemanager.local-dirs > /hadoop/yarn/local > > > Cheers > > On Thu, Mar 19, 20

Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-03-31 Thread Jaonary Rabarisoa
Hi all, DataFrame with an user defined type (here mllib.Vector) created with sqlContex.createDataFrame can't be saved to parquet file and raise ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast to org.apache.spark.sql.Row error. Here is an example of code to reproduce t

Broadcasting a parquet file using spark and python

2015-03-31 Thread jitesh129
How can we implement a BroadcastHashJoin for spark with python? My SparkSQL inner joins are taking a lot of time since it is performing ShuffledHashJoin. Tables on which join is performed are stored as parquet files. Please help. Thanks and regards, Jitesh -- View this message in context: h

Re: Spark streaming with Kafka, multiple partitions fail, single partition ok

2015-03-31 Thread Nicolas Phung
Hello, @Akhil Das I'm trying to use the experimental API https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
Thank you, @GuoQiang I will try to add runGC() to the ALS.scala, and if it works for deleting the shuffle data, I will tell you :-) > ?? 2015??3??314:47??GuoQiang Li ?? > > You can try to enforce garbage collection: > > /** Run GC and make sure it actually has run */ > def runGC(

Re: Error in Delete Table

2015-03-31 Thread Masf
Hi Ted. Spark 1.2.0 an Hive 0.13.1 Regards. Miguel Angel. On Tue, Mar 31, 2015 at 10:37 AM, Ted Yu wrote: > Which Spark and Hive release are you using ? > > Thanks > > > > > On Mar 27, 2015, at 2:45 AM, Masf wrote: > > > > Hi. > > > > In HiveContext, when I put this statement "DROP TABLE IF

Re: Error in Delete Table

2015-03-31 Thread Ted Yu
Which Spark and Hive release are you using ? Thanks > On Mar 27, 2015, at 2:45 AM, Masf wrote: > > Hi. > > In HiveContext, when I put this statement "DROP TABLE IF EXISTS TestTable" > If TestTable doesn't exist, spark returns an error: > > > > ERROR Hive: NoSuchObjectException(message:def

Re: JettyUtils.createServletHandler Method not Found?

2015-03-31 Thread kmader
Yes, this private is checked at compile time and my class is in a subpackage of org.apache.spark.ui, so the visibility is not the issue, or at least not as far as I can tell. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JettyUtils-createServletHandler-Met

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-31 Thread Jaonary Rabarisoa
In my transformSchema I do specify that the output column type is a VectorUDT : *override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = { val map = this.paramMap ++ paramMap checkInputColumn(schema, map(inputCol), ArrayType(FloatType, false)) addOutputColumn(schem

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-31 Thread Shivaram Venkataraman
My guess is that the `createDataFrame` call is failing here. Can you check if the schema being passed to it includes the column name and type for the newly being zipped `features` ? Joseph probably knows this better, but AFAIK the DenseVector here will need to be marked as a VectorUDT while creat

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
I have update my spark source code to 1.3.1. the checkpoint works well. BUT the shuffle data still could not be delete automatically…the disk usage is still 30TB… I have set the spark.cleaner.referenceTracking.blocking.shuffle to true. Do you know how to solve my problem? Sendong Li > 在 2

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-31 Thread Jaonary Rabarisoa
Following your suggestion, I end up with the following implementation : *override def transform(dataSet: DataFrame, paramMap: ParamMap): DataFrame = { val schema = transformSchema(dataSet.schema, paramMap, logging = true) val map = this.paramMap ++ paramMap* *val features = da

Re: log4j.properties in jar

2015-03-31 Thread Emre Sevinc
Hello Udit, Yes, what you ask is possible. If you follow the Spark documentation and tutorial about how to build stand-alone applications, you can see that it is possible to build a stand-alone, über-JAR file that includes everything. For example, if you want to suppress some messages by modifyin

Re: Parquet Hive table become very slow on 1.3?

2015-03-31 Thread Cheng Lian
Hi Xudong, This is probably because of Parquet schema merging is turned on by default. This is generally useful for Parquet files with different but compatible schemas. But it needs to read metadata from all Parquet part-files. This can be problematic when reading Parquet files with lots of p

workers no route to host

2015-03-31 Thread ZhuGe
Hi,i set up a standalone cluster of 5 machines(tmaster, tslave1,2,3,4) with spark-1.3.0-cdh5.4.0-snapshort. when i execute the sbin/start-all.sh, the master is ok, but i cant see the web ui. Moreover, the worker logs is something like this: Spark assembly has been built with Hive, including Data