Parquet Hive table become very slow on 1.3?

2015-03-31 Thread Zheng, Xudong
Hi all, We are using Parquet Hive table, and we are upgrading to Spark 1.3. But we find that, just a simple COUNT(*) query will much slower (100x) than Spark 1.2. I find the most time spent on driver to get HDFS blocks. I find large amount of get below logs printed: 15/03/30 23:03:43 DEBUG

Re: How to configure SparkUI to use internal ec2 ip

2015-03-31 Thread Akhil Das
You can add an internal ip to public hostname mapping in your /etc/hosts file, if your forwarding is proper then it wouldn't be a problem there after. Thanks Best Regards On Tue, Mar 31, 2015 at 9:18 AM, anny9699 anny9...@gmail.com wrote: Hi, For security reasons, we added a server between

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
I have update my spark source code to 1.3.1. the checkpoint works well. BUT the shuffle data still could not be delete automatically…the disk usage is still 30TB… I have set the spark.cleaner.referenceTracking.blocking.shuffle to true. Do you know how to solve my problem? Sendong Li 在

Re: JettyUtils.createServletHandler Method not Found?

2015-03-31 Thread kmader
Yes, this private is checked at compile time and my class is in a subpackage of org.apache.spark.ui, so the visibility is not the issue, or at least not as far as I can tell. -- View this message in context:

Re: Error in Delete Table

2015-03-31 Thread Ted Yu
Which Spark and Hive release are you using ? Thanks On Mar 27, 2015, at 2:45 AM, Masf masfwo...@gmail.com wrote: Hi. In HiveContext, when I put this statement DROP TABLE IF EXISTS TestTable If TestTable doesn't exist, spark returns an error: ERROR Hive:

Re: log4j.properties in jar

2015-03-31 Thread Emre Sevinc
Hello Udit, Yes, what you ask is possible. If you follow the Spark documentation and tutorial about how to build stand-alone applications, you can see that it is possible to build a stand-alone, über-JAR file that includes everything. For example, if you want to suppress some messages by

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-31 Thread Shivaram Venkataraman
My guess is that the `createDataFrame` call is failing here. Can you check if the schema being passed to it includes the column name and type for the newly being zipped `features` ? Joseph probably knows this better, but AFAIK the DenseVector here will need to be marked as a VectorUDT while

Re: Error in Delete Table

2015-03-31 Thread Masf
Hi Ted. Spark 1.2.0 an Hive 0.13.1 Regards. Miguel Angel. On Tue, Mar 31, 2015 at 10:37 AM, Ted Yu yuzhih...@gmail.com wrote: Which Spark and Hive release are you using ? Thanks On Mar 27, 2015, at 2:45 AM, Masf masfwo...@gmail.com wrote: Hi. In HiveContext, when I put this

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
Thank you, @GuoQiang I will try to add runGC() to the ALS.scala, and if it works for deleting the shuffle data, I will tell you :-) ?? 2015??3??314:47??GuoQiang Li wi...@qq.com ?? You can try to enforce garbage collection: /** Run GC and make sure it actually has run */ def

workers no route to host

2015-03-31 Thread ZhuGe
Hi,i set up a standalone cluster of 5 machines(tmaster, tslave1,2,3,4) with spark-1.3.0-cdh5.4.0-snapshort. when i execute the sbin/start-all.sh, the master is ok, but i cant see the web ui. Moreover, the worker logs is something like this: Spark assembly has been built with Hive, including

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-31 Thread Jaonary Rabarisoa
Following your suggestion, I end up with the following implementation : *override def transform(dataSet: DataFrame, paramMap: ParamMap): DataFrame = { val schema = transformSchema(dataSet.schema, paramMap, logging = true) val map = this.paramMap ++ paramMap* *val features =

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-31 Thread Jaonary Rabarisoa
In my transformSchema I do specify that the output column type is a VectorUDT : *override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = { val map = this.paramMap ++ paramMap checkInputColumn(schema, map(inputCol), ArrayType(FloatType, false))

Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-03-31 Thread Jaonary Rabarisoa
Hi all, DataFrame with an user defined type (here mllib.Vector) created with sqlContex.createDataFrame can't be saved to parquet file and raise ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast to org.apache.spark.sql.Row error. Here is an example of code to reproduce

Re: spark there is no space on the disk

2015-03-31 Thread Peng Xia
Yes, we have just modified the configuration, and every thing works fine. Thanks very much for the help. On Thu, Mar 19, 2015 at 5:24 PM, Ted Yu yuzhih...@gmail.com wrote: For YARN, possibly this one ? property nameyarn.nodemanager.local-dirs/name

Re: Can't run spark-submit with an application jar on a Mesos cluster

2015-03-31 Thread hbogert
Well that are only the logs of the slaves on mesos level, I'm not sure from your reply if you can ssh into a specific slave or not, if you can, you should look at actual output of the application (spark in this case) on a slave in e.g.

Re: Spark streaming with Kafka, multiple partitions fail, single partition ok

2015-03-31 Thread Nicolas Phung
Hello, @Akhil Das I'm trying to use the experimental API https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala

Broadcasting a parquet file using spark and python

2015-03-31 Thread jitesh129
How can we implement a BroadcastHashJoin for spark with python? My SparkSQL inner joins are taking a lot of time since it is performing ShuffledHashJoin. Tables on which join is performed are stored as parquet files. Please help. Thanks and regards, Jitesh -- View this message in context:

Re: refer to dictionary

2015-03-31 Thread Ted Yu
You can use broadcast variable. See also this thread: http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variablesubj=How+Broadcast+variable+scale+ On Mar 31, 2015, at 4:43 AM, Peng Xia sparkpeng...@gmail.com wrote: Hi, I have a RDD (rdd1)where each line is split into an array [a,

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-31 Thread Sean Owen
I had always understood the formulation to be the first option you describe. Lambda is scaled by the number of items the user has rated / interacted with. I think the goal is to avoid fitting the tastes of prolific users disproportionately just because they have many ratings to fit. This is what's

RE: SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread java8964
You can use the HiveContext instead of SQLContext, which should support all the HiveQL, including lateral view explode. SQLContext is not supporting that yet. BTW, nice coding format in the email. Yong Date: Tue, 31 Mar 2015 18:18:19 -0400 Subject: Re: SparkSql -

Re: Can't run spark-submit with an application jar on a Mesos cluster

2015-03-31 Thread seglo
Thanks hbogert. There it is plain as day; it can't find my spark binaries. I thought it was enough to set SPARK_EXECUTOR_URI in my spark-env.sh since this is all that's necessary to run spark-shell.sh against a mesos master, but I also had to set spark.executor.uri in my spark-defaults.conf (or

Spark SQL saveAsParquet failed after a few waves

2015-03-31 Thread Yijie Shen
Hi, I am using spark-1.3 prebuilt release with hadoop2.4 support and Hadoop 2.4.0. I wrote a spark application(LoadApp) to generate data in each task and load the data into HDFS as parquet Files (use “saveAsParquet()” in spark sql) When few waves (1 or 2) are used in a job, LoadApp could

Re: Actor not found

2015-03-31 Thread Shixiong Zhu
Thanks for the log. It's really helpful. I created a JIRA to explain why it will happen: https://issues.apache.org/jira/browse/SPARK-6640 However, will this error always happens in your environment? Best Regards, Shixiong Zhu 2015-03-31 22:36 GMT+08:00 sparkdi shopaddr1...@dubna.us: This is

Anatomy of RDD : Deep dive into RDD data structure

2015-03-31 Thread madhu phatak
Hi, Recently I gave a talk on RDD data structure which gives in depth understanding of spark internals. You can watch it on youtube https://www.youtube.com/watch?v=WVdyuVwWcBc. Also slides are on slideshare http://www.slideshare.net/datamantra/anatomy-of-rdd and code is on github

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
In my experiment, if I do not call gc() explicitly, the shuffle files will not be cleaned until the whole job finish… I don’t know why, maybe the rdd could not be GCed implicitly. In my situation, a full gc in driver takes about 10 seconds, so I start a thread in driver to do GC like this :

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-31 Thread Xiangrui Meng
I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-6637. Since we don't have a clear answer about how the scaling should be handled. Maybe the best solution for now is to switch back to the 1.2 scaling. -Xiangrui On Tue, Mar 31, 2015 at 2:50 PM, Sean Owen so...@cloudera.com

Re: Broadcasting a parquet file using spark and python

2015-03-31 Thread Jitesh chandra Mishra
Hi Michael, Thanks for your response. I am running 1.2.1. Is there any workaround to achieve the same with 1.2.1? Thanks, Jitesh On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust mich...@databricks.com wrote: In Spark 1.3 I would expect this to happen automatically when the parquet table is

Re: spark.sql.Row manipulation

2015-03-31 Thread Michael Armbrust
You can do something like: df.collect().map { case Row(name: String, age1: Int, age2: Int) = ... } On Tue, Mar 31, 2015 at 4:05 PM, roni roni.epi...@gmail.com wrote: I have 2 paraquet files with format e.g name , age, town I read them and then join them to get all the names which are in

Minimum slots assigment to Spark on Mesos

2015-03-31 Thread Stratos Dimopoulos
Hi All, I am running Spark MR on Mesos. Is there a configuration setting for Spark to define the minimum required slots (similar to MapReduce's mapred.mesos.total.reduce.slots.minimum and mapred.mesos.total.map.slots. minimum)? The most related property I see is this: spark.scheduler.

Re: Spark streaming with Kafka, multiple partitions fail, single partition ok

2015-03-31 Thread Ted Yu
Can you show us the output of DStream#print() if you have it ? Thanks On Tue, Mar 31, 2015 at 2:55 AM, Nicolas Phung nicolas.ph...@gmail.com wrote: Hello, @Akhil Das I'm trying to use the experimental API

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-31 Thread Vincent He
It works,thanks for your great help. On Mon, Mar 30, 2015 at 10:07 PM, Denny Lee denny.g@gmail.com wrote: Hi Vincent, This may be a case that you're missing a semi-colon after your CREATE TEMPORARY TABLE statement. I ran your original statement (missing the semi-colon) and got the same

Re: can't union two rdds

2015-03-31 Thread ankurjain.nitrr
Rdd union will result in 1 2 3 4 5 6 7 8 9 10 11 12 What you are trying to do is join. There must be a logic/key to perform join operation. I think in your case you want the order (index) to be the joining key here. RDD is a distributed data structure and is not apt for your

Re: Actor not found

2015-03-31 Thread sparkdi
This is the whole output from the shell: ~/spark-1.3.0-bin-hadoop2.4$ sudo bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Sean Bigdatafun
(resending...) I was thinking the same setup… But the more I think of this problem, and the more interesting this could be. If we allocate 50% total memory to Tachyon statically, then the Mesos benefits of dynamically scheduling resources go away altogether. Can Tachyon be resource managed by

rdd.cache() not working ?

2015-03-31 Thread fightf...@163.com
Hi, all Running the following code snippet through spark-shell, however cannot see any cached storage partitions in web ui. Does this mean that cache now working ? Cause if we issue person.count again that we cannot say any time consuming performance upgrading. Hope anyone can explain this

How to setup a Spark Cluter?

2015-03-31 Thread bhushansc007
Hi All, I am quite new to Spark. So please pardon me if it is a very basic question. I have setup a Hadoop cluster using Hortonwork's Ambari. It has 1 Master and 3 Worker nodes. Currently, it has HDFS, Yarn, MapReduce2, HBase and ZooKeeper services installed. Now, I want to install Spark on

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
guoqiang ??s method works very well ?? it only takes 1TB disk now. thank you very much! ?? 2015??3??314:47??GuoQiang Li wi...@qq.com ?? You can try to enforce garbage collection: /** Run GC and make sure it actually has run */ def runGC() { val weakRef = new

Re: refer to dictionary

2015-03-31 Thread Peng Xia
Hi Ted, Thanks very much, yea, using broadcast is much faster. Best, Peng On Tue, Mar 31, 2015 at 8:49 AM, Ted Yu yuzhih...@gmail.com wrote: You can use broadcast variable. See also this thread:

Ambiguous references to a field set in a partitioned table AND the data

2015-03-31 Thread nfo
Hi, I save Parquet files in a partitioned table, so in a path looking like /path/to/table/myfield=a/ . But I also kept the field myfield in the Parquet data. Thus. when I query the field, I get this error: df.select(myfield).show(10) Exception in thread main

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Udit Mehta
I have noticed a similar issue when using spark streaming. The spark shuffle write size increases to a large size(in GB) and then the app crashes saying: java.io.FileNotFoundException:

Re: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-03-31 Thread Sean Owen
Yep, it's not serializable: https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html You can't return this from a distributed operation since that would mean it has to travel over the network and you haven't supplied any way to convert the thing into bytes. On Tue, Mar 31,

Re: Broadcasting a parquet file using spark and python

2015-03-31 Thread Michael Armbrust
In Spark 1.3 I would expect this to happen automatically when the parquet table is small ( 10mb, configurable with spark.sql.autoBroadcastJoinThreshold). If you are running 1.3 and not seeing this, can you show the code you are using to create the table? On Tue, Mar 31, 2015 at 3:25 AM, jitesh129

SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread Todd Nist
I am accessing ElasticSearch via the elasticsearch-hadoop and attempting to expose it via SparkSQL. I am using spark 1.2.1, latest supported by elasticsearch-hadoop, and org.elasticsearch % elasticsearch-hadoop % 2.1.0.BUILD-SNAPSHOT of elasticsearch-hadoop. I’m encountering an issue when I

java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-03-31 Thread Jeetendra Gangele
When I am trying to get the result from Hbase and running mapToPair function of RRD its giving the error java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result Here is the code // private static JavaPairRDDInteger, Result getCompanyDataRDD(JavaSparkContext sc) throws IOException

Re: How to configure SparkUI to use internal ec2 ip

2015-03-31 Thread Anny Chen
Hi Akhil, I tried editing the /etc/hosts on the master and on the workers, and seems it is not working for me. I tried adding hostname internal-ip and it didn't work. I then tried adding internal-ip hostname and it didn't work either. I guess I should also edit the spark-env.sh file? Thanks!

Re: How to setup a Spark Cluter?

2015-03-31 Thread Akhil Das
Its pretty simple, pick one machine as master (say machine A), and lets call the workers are B,C, and D *Login to A:* - Enable passwd less authentication (ssh-keygen) - Add A's ~/.ssh/id_rsa.pub to B,C,D's ~/.ssh/authorized_keys file - Download spark binary (that supports your hadoop

Re: How to configure SparkUI to use internal ec2 ip

2015-03-31 Thread Petar Zecevic
Did you try setting the SPARK_MASTER_IP parameter in spark-env.sh? On 31.3.2015. 19:19, Anny Chen wrote: Hi Akhil, I tried editing the /etc/hosts on the master and on the workers, and seems it is not working for me. I tried adding hostname internal-ip and it didn't work. I then tried

Re: How to configure SparkUI to use internal ec2 ip

2015-03-31 Thread Akhil Das
When you say you added internal-ip hostname, where you able to ping any of these from the machine? You could try setting SPARK_LOCAL_IP on all machines. But make sure you will be able to bind to that host/ip specified there. Thanks Best Regards On Tue, Mar 31, 2015 at 10:49 PM, Anny Chen

Re: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-03-31 Thread Nan Zhu
The example in https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala might help Best, -- Nan Zhu http://codingcat.me On Tuesday, March 31, 2015 at 3:56 PM, Sean Owen wrote: Yep, it's not serializable:

Re: Spark-events does not exist error, while it does with all the req. rights

2015-03-31 Thread Marcelo Vanzin
Hmmm... could you try to set the log dir to file:/home/hduser/spark/spark-events? I checked the code and it might be the case that the behaviour changed between 1.2 and 1.3... On Mon, Mar 30, 2015 at 6:44 PM, Tom Hubregtsen thubregt...@gmail.com wrote: The stack trace for the first scenario and

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Udit Mehta
Thanks for the reply. This will reduce the shuffle write to disk to an extent but for a long running job(multiple days), the shuffle write would still occupy a lot of space on disk. Why do we need to store the data from older map tasks to memory? On Tue, Mar 31, 2015 at 1:19 PM, Bijay Pathak

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Bijay Pathak
The Spark Sort-Based Shuffle (default from 1.1) keeps the data from each Map tasks to memory until they they can't fit after which they are sorted and spilled to disk. You can reduce the Shuffle write to disk by increasing spark.shuffle.memoryFraction(default 0.2). By writing the shuffle output

Using 'fair' scheduler mode

2015-03-31 Thread asadrao
Hi, I am using the Spark ‘fair’ scheduler mode. I have noticed that if the first query is a very expensive query (ex: ‘select *’ on a really big data set) than any subsequent query seem to get blocked. I would have expected the second query to run in parallel since I am using the ‘fair’ scheduler

Did anybody run Spark-perf on powerpc?

2015-03-31 Thread Tom
We verified it runs on x86, and are now trying to run it on powerPC. We currently run into dependency trouble with sbt. I tried installing sbt by hand and resolving all dependencies by hand, but must have made an error, as I still get errors. Original error: Getting org.scala-sbt sbt 0.13.6 ...

Re: Query REST web service with Spark?

2015-03-31 Thread Burak Yavuz
Hi, If I recall correctly, I've read people integrating REST calls to Spark Streaming jobs in the user list. I don't imagine any cases for why it shouldn't be possible. Best, Burak On Tue, Mar 31, 2015 at 1:46 PM, Minnow Noir minnown...@gmail.com wrote: We have have some data on Hadoop that

Ambiguous references to a field set in a partitioned table AND the data

2015-03-31 Thread Nicolas Fouché
Hi, I save Parquet files in a partitioned table, so in /path/to/table/myfield=a/ . But I also kept the field myfield in the Parquet data. Thus. when I query the field, I get this error: df.select(myfield).show(10) Exception in thread main org.apache.spark.sql.AnalysisException: Ambiguous

pyspark error with zip

2015-03-31 Thread Charles Hayden
? The following program fails in the zip step. x = sc.parallelize([1, 2, 3, 1, 2, 3]) y = sc.parallelize([1, 2, 3]) z = x.distinct() print x.zip(y).collect() The error that is produced depends on whether multiple partitions have been specified or not. I understand that the two RDDs [must]

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread Xiangrui Meng
Hey Guoqiang and Sendong, Could you comment on the overhead of calling gc() explicitly? The shuffle files should get cleaned in a few seconds after checkpointing, but it is certainly possible to accumulates TBs of files in a few seconds. In this case, calling gc() may work the same as waiting for

joining multiple parquet files

2015-03-31 Thread roni
Hi , I have 4 parquet files and I want to find data which is common in all of them e.g SELECT TableA.*, TableB.*, TableC.*, TableD.* FROM (TableB INNER JOIN TableA ON TableB.aID= TableA.aID) INNER JOIN TableC ON(TableB.cID= Tablec.cID) INNER JOIN TableA ta ON(ta.dID= TableD.dID)

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-03-31 Thread Xiangrui Meng
I cannot reproduce this error on master, but I'm not aware of any recent bug fixes that are related. Could you build and try the current master? -Xiangrui On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, DataFrame with an user defined type (here mllib.Vector)

Re: can't union two rdds

2015-03-31 Thread roy
use zip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22321.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe,

Spark sql query fails with executor lost/ out of memory expection while caching a table

2015-03-31 Thread ankurjain.nitrr
Hi, I am using spark 1.2.1 I am using thrift server to query my data. while executing query CACHE TABLE tablename Fails with exception Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage

deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am fairly new to the spark ecosystem and I have been trying to setup a spark on mesos deployment. I can't seem to figure out the best practices around HDFS and Tachyon. The documentation about Spark's data-locality section seems to point that

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Haoyuan Li
Tachyon should be co-located with Spark in this case. Best, Haoyuan On Tue, Mar 31, 2015 at 4:30 PM, Ankur Chauhan achau...@brightcove.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am fairly new to the spark ecosystem and I have been trying to setup a spark on mesos

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Haoyuan, So on each mesos slave node I should allocate/section off some amount of memory for tachyon (let's say 50% of the total memory) and the rest for regular mesos tasks? This means, on each slave node I would have tachyon worker (+ hdfs

Re: Query REST web service with Spark?

2015-03-31 Thread Todd Nist
Here are a few ways to achieve what your loolking to do: https://github.com/cjnolet/spark-jetty-server Spark Job Server - https://github.com/spark-jobserver/spark-jobserver - defines a REST API for Spark Hue -

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-31 Thread Xiangrui Meng
Hey Sean, That is true for explicit model, but not for implicit. The ALS-WR paper doesn't cover the implicit model. In implicit formulation, a sub-problem (for v_j) is: min_{v_j} \sum_i c_ij (p_ij - u_i^T v_j)^2 + lambda * X * \|v_j\|_2^2 This is a sum for all i but not just the users who rate

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Bijay Pathak
Hi Udit, The persisted RDDs in memory is cleared by Spark using LRU policy and you can also set the time to clear the persisted RDDs and meta-data by setting* spark.cleaner.ttl *(default infinite). But I am not aware about any properties to clean the older shuffle write from from disks. thanks,

Re: SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread Todd Nist
So in looking at this a bit more, I gather the root cause is the fact that the nested fields are represented as rows within rows, is that correct? If I don't know the size of the json array (it varies), using x.getAs[Row](0).getString(0) is not really a valid solution. Is the solution to apply a

--driver-memory parameter doesn't work for spark-submmit on yarn?

2015-03-31 Thread Shuai Zheng
Hi All, Below is the my shell script: /home/hadoop/spark/bin/spark-submit --driver-memory=5G --executor-memory=40G --master yarn-client --class com.***.FinancialEngineExecutor /home/hadoop/lib/my.jar s3://bucket/vriscBatchConf.properties My driver will load some resources and then

Re: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-03-31 Thread Ted Yu
Jeetendra: Please extract the information you need from Result and return the extracted portion - instead of returning Result itself. Cheers On Tue, Mar 31, 2015 at 1:14 PM, Nan Zhu zhunanmcg...@gmail.com wrote: The example in

Query REST web service with Spark?

2015-03-31 Thread Minnow Noir
We have have some data on Hadoop that needs augmented with data only available to us via a REST service. We're using Spark to search for, and correct, missing data. Even though there are a lot of records to scour for missing data, the total number of calls to the service is expected to be low, so

spark.sql.Row manipulation

2015-03-31 Thread roni
I have 2 paraquet files with format e.g name , age, town I read them and then join them to get all the names which are in both towns . the resultant dataset is res4: Array[org.apache.spark.sql.Row] = Array([name1, age1, town1,name2,age2,town2]) Name 1 and name 2 are same as I am joining

Re: How to configure SparkUI to use internal ec2 ip

2015-03-31 Thread Anny Chen
Thanks Petar and Akhil for the suggestion. Actually I changed the SPARK_MASTER_IP to the internal-ip, deleted the export SPARK_PUBLIC_DNS=xx line in the spark-env.sh and also edited the /etc/hosts as Akhil suggested, and now it is working! However I don't know which change actually makes it