Re: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain

2015-01-19 Thread Akhil Das
Try this piece of code: System.setProperty("AWS_ACCESS_KEY_ID", "access_key") System.setProperty("AWS_SECRET_KEY", "secret") val streamName = "mystream" val endpointUrl = "https://kinesis.us-east-1.amazonaws.com/"; val kinesisClient = new AmazonKinesisClient(new DefaultAWSCred

com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain

2015-01-19 Thread Hafiz Mujadid
Hi all! I am trying to use kinesis and spark streaming together. So when I execute program I get exception com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain Here is my piece of code val credentials = new BasicAWSCredentials(KinesisProperties.AWS_

Re: Bind Exception

2015-01-19 Thread Prashant Sharma
Deep, Yes you have another spark shell or application sticking around somewhere. Try to inspect running processes and lookout for jave process. And kill it. This might be helpful https://www.digitalocean.com/community/tutorials/how-to-use-ps-kill-and-nice-to-manage-processes-in-linux Also, That

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Kevin (Sangwoo) Kim
If keys are not too many, You can do like this: val data = List( ("A", Set(1,2,3)), ("A", Set(1,2,4)), ("B", Set(1,2,3)) ) val rdd = sc.parallelize(data) rdd.persist() rdd.filter(_._1 == "A").flatMap(_._2).distinct.count rdd.filter(_._1 == "B").flatMap(_._2).distinct.count rdd.unpersist()

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread jagaximo
That i want to do, get unique count for each key. so take map() or countByKey(), not get unique count. (because duplicate string is likely to be counted)... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-include-la

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Kevin (Sangwoo) Kim
In your code, you're doing combination of large sets, like (set1 ++ set2).size which is not a good idea. (rdd1 ++ rdd2).distinct is equivalent implementation and will compute in distributed manner. Not very sure your computation on key'd sets are feasible to be transformed into RDDs. Regards, Kev

Re: Finding most occurrences in a JSON Nested Array

2015-01-19 Thread Pankaj Narang
I just checked the post. do you need help still ? I think getAs(Seq[String]) should help. If you are still stuck let me know. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p21252.html Sent from t

Re: Bind Exception

2015-01-19 Thread Deep Pradhan
Yes, I have increased the driver memory in spark-default.conf to 2g. Still the error persists. On Tue, Jan 20, 2015 at 10:18 AM, Ted Yu wrote: > Have you seen these threads ? > > http://search-hadoop.com/m/JW1q5tMFlb > http://search-hadoop.com/m/JW1q5dabji1 > > Cheers > > On Mon, Jan 19, 2015 at

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Kevin Jung
As far as I know, the tasks before calling saveAsText are transformations so that they are lazy computed. Then saveAsText action performs all transformations and your Set[String] grows up at this time. It creates large collection if you have few keys and this causes OOM easily when your executor m

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Pankaj Narang
Instead of counted.saveAsText(“/path/to/save/dir") if you call counted.collect what happens ? If you still face the same issue please paste the stacktrace here. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-inclu

Re: Bind Exception

2015-01-19 Thread Ted Yu
Have you seen these threads ? http://search-hadoop.com/m/JW1q5tMFlb http://search-hadoop.com/m/JW1q5dabji1 Cheers On Mon, Jan 19, 2015 at 8:33 PM, Deep Pradhan wrote: > Hi Ted, > When I am running the same job with small data, I am able to run. But when > I run it with relatively bigger set of

Re: Bind Exception

2015-01-19 Thread Deep Pradhan
I closed the Spark Shell and tried but no change. Here is the error: . 15/01/17 14:33:39 INFO AbstractConnector: Started SocketConnector@0.0.0.0:59791 15/01/17 14:33:39 INFO Server: jetty-8.y.z-SNAPSHOT 15/01/17 14:33:39 WARN AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:40

Re: Bind Exception

2015-01-19 Thread Deep Pradhan
I had the Spark Shell running through out. Is it because of that? On Tue, Jan 20, 2015 at 9:47 AM, Ted Yu wrote: > Was there another instance of Spark running on the same machine ? > > Can you pastebin the full stack trace ? > > Cheers > > On Mon, Jan 19, 2015 at 8:11 PM, Deep Pradhan > wrote:

Re: Bind Exception

2015-01-19 Thread Ted Yu
Was there another instance of Spark running on the same machine ? Can you pastebin the full stack trace ? Cheers On Mon, Jan 19, 2015 at 8:11 PM, Deep Pradhan wrote: > Hi, > I am running a Spark job. I get the output correctly but when I see the > logs file I see the following: > AbstractLifeC

Bind Exception

2015-01-19 Thread Deep Pradhan
Hi, I am running a Spark job. I get the output correctly but when I see the logs file I see the following: AbstractLifeCycle: FAILED.: java.net.BindException: Address already in use... What could be the reason for this? Thank You

How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread jagaximo
i want compute RDD[(String, Set[String])] that include a part of large size ’Set[String]’. -- val hoge: RDD[(String, Set[String])] = ... val reduced = hoge.reduceByKey(_ ++ _) //<= create large size Set (shuffle read size 7GB) val counted = reduced.map{ case (key, strSeq) => s”$key\t${

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-19 Thread Ashish
Sean, A related question. When to persist the RDD after step 2 or after Step 3 (nothing would happen before step 3 I assume)? On Mon, Jan 19, 2015 at 5:17 PM, Sean Owen wrote: > From the OP: > > (1) val lines = Import full dataset using sc.textFile > (2) val ABonly = Filter out all rows from "li

Re: How to output to S3 and keep the order

2015-01-19 Thread Aniket Bhatnagar
When you repartiton, ordering can get lost. You would need to sort after repartitioning. Aniket On Tue, Jan 20, 2015, 7:08 AM anny9699 wrote: > Hi, > > I am using Spark on AWS and want to write the output to S3. It is a > relatively small file and I don't want them to output as multiple parts.

RE: MatchError in JsonRDD.toLong

2015-01-19 Thread Wang, Daoyuan
Yes, actually that is what I mean exactly. And maybe you missed my last response, you can use the API: jsonRDD(json:RDD[String], schema:StructType) to clearly clarify your schema. For numbers bigger than Long, we can use DecimalType. Thanks, Daoyuan From: Tobias Pfeiffer [mailto:t...@preferred

How to output to S3 and keep the order

2015-01-19 Thread anny9699
Hi, I am using Spark on AWS and want to write the output to S3. It is a relatively small file and I don't want them to output as multiple parts. So I use result.repartition(1).saveAsTextFile("s3://...") However as long as I am using the saveAsTextFile method, the output doesn't keep the original

Re: Issues with constants in Spark HiveQL queries

2015-01-19 Thread Pala M Muthaia
Yes we tried the master branch (sometime last week) and there was no issue, but the above repro is for branch 1.2 and Hive 0.13. Isn't that the final release branch for Spark 1.2? If so, a patch needs to be created or back-ported from master? (Yes the obvious typo in the column name was introduce

Re: MatchError in JsonRDD.toLong

2015-01-19 Thread Tobias Pfeiffer
Hi, On Fri, Jan 16, 2015 at 6:14 PM, Wang, Daoyuan wrote: > The second parameter of jsonRDD is the sampling ratio when we infer schema. > OK, I was aware of this, but I guess I understand the problem now. My sampling ratio is so low that I only see the Long values of data items and infer it's a

Re: Why custom parquet format hive table execute "ParquetTableScan" physical plan, not "HiveTableScan"?

2015-01-19 Thread Xiaoyu Wang
The *spark.sql.parquet.**filterPushdown=true *has been turned on. But set *spark.sql.hive.**convertMetastoreParquet *to *false*. the first parameter is lose efficacy!!! 2015-01-20 6:52 GMT+08:00 Yana Kadiyska : > If you're talking about filter pushdowns for parquet files this also has > to be tur

Re: If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?

2015-01-19 Thread Tobias Pfeiffer
Hi, On Sat, Jan 17, 2015 at 3:37 AM, Peng Cheng wrote: > I'm talking about RDD1 (not persisted or checkpointed) in this situation: > > ...(somewhere) -> RDD1 -> RDD2 > || > V V >

Re: How to get the master URL at runtime inside driver program?

2015-01-19 Thread Tobias Pfeiffer
Hi, On Sun, Jan 18, 2015 at 11:08 AM, guxiaobo1982 wrote: > > Driver programs submitted by the spark-submit script will get the runtime > spark master URL, but how it get the URL inside the main method when > creating the SparkConf object? > The master will be stored in the spark.master property

Re: Why custom parquet format hive table execute "ParquetTableScan" physical plan, not "HiveTableScan"?

2015-01-19 Thread Yana Kadiyska
If you're talking about filter pushdowns for parquet files this also has to be turned on explicitly. Try *spark.sql.parquet.**filterPushdown=true . *It's off by default On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wrote: > Yes it works! > But the filter can't pushdown!!! > > If custom parquetin

Aggregations based on sort order

2015-01-19 Thread justin.uang
Hi, I am trying to aggregate a key based on some timestamp, and I believe that spilling to disk is changing the order of the data fed into the combiner. I have some timeseries data that is of the form: ("key", "date", "other data") Partition 1 ("A", 2, ...) ("B", 4, ...) ("A", 1,

Re: ERROR TaskSchedulerImpl: Lost an executor

2015-01-19 Thread Suresh Lankipalli
Hi Yu, I am able to run Spark-example's, I am unable to run SparkR example (only Pi example is running on SparkR). Thank you Regards Suresh On Mon, Jan 19, 2015 at 3:08 PM, Ted Yu wrote: > Have you seen this thread ? > http://search-hadoop.com/m/JW1q5PgA7X > > What Spark release are you runn

Re: ERROR TaskSchedulerImpl: Lost an executor

2015-01-19 Thread Suresh Lankipalli
I am running Spark 1.2.0 version On Mon, Jan 19, 2015 at 3:08 PM, Ted Yu wrote: > Have you seen this thread ? > http://search-hadoop.com/m/JW1q5PgA7X > > What Spark release are you running ? > > Cheers > > On Mon, Jan 19, 2015 at 12:04 PM, suresh wrote: > >> I am trying to run SparkR shell on a

Re: ERROR TaskSchedulerImpl: Lost an executor

2015-01-19 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/JW1q5PgA7X What Spark release are you running ? Cheers On Mon, Jan 19, 2015 at 12:04 PM, suresh wrote: > I am trying to run SparkR shell on aws > > I am unable to access worker nodes webUI access. > > 15/01/19 19:57:17 ERROR TaskSchedulerI

Re: ERROR TaskSchedulerImpl: Lost an executor

2015-01-19 Thread suresh
I am trying to run SparkR shell on aws I am unable to access worker nodes webUI access. 15/01/19 19:57:17 ERROR TaskSchedulerImpl: Lost an executor 0 (already removed): remote Akka client disassociated 15/01/19 19:57:17 ERROR TaskSchedulerImpl: Lost an executor 1 (already removed): remote Akka cl

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-19 Thread Jerry Lam
Hi guys, Does this issue affect 1.2.0 only or all previous releases as well? Best Regards, Jerry On Thu, Jan 8, 2015 at 1:40 AM, Xuelin Cao wrote: > > Yes, the problem is, I've turned the flag on. > > One possible reason for this is, the parquet file supports "predicate > pushdown" by settin

Re: EC2 VPC script

2015-01-19 Thread Vladimir Grigor
I also found this issue. I have reported it as a bug https://issues.apache.org/jira/browse/SPARK-5242 and submitted a fix. You can find link to fixed fork in the comments on the issue page. Please vote on the issue, hopefully guys will accept pull request faster then :) Regards, Vladimir On Mon,

Error for first run from iPython Notebook

2015-01-19 Thread Dave
Hi, I've setup my first spark cluster (1 master, 2 workers) and an iPython notebook server that I'm trying to setup to access the cluster. I'm running the workers from Anaconda to make sure the python setup is correct on each box. The iPy notebook server appears to have everything setup correctly,

Re: How to get the master URL at runtime inside driver program?

2015-01-19 Thread Raghavendra Pandey
If you pass spark master URL to spark-submit, you don't need to pass the same to SparkConf object. You can create SparkConf without this property or for that matter any other property that you pass in spark-submit. On Sun, Jan 18, 2015 at 7:38 AM, guxiaobo1982 wrote: > Hi, > > Driver programs su

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-19 Thread Sean Owen
Keep in mind that your executors will be able to run some fixed number of tasks in parallel, given your configuration. You should not necessarily expect that arbitrarily many RDDs and tasks would schedule simultaneously. On Mon, Jan 19, 2015 at 5:34 PM, critikaled wrote: > Hi, john and david > I

Re: unit tests with "java.io.IOException: Could not create FileClient"

2015-01-19 Thread Ted Yu
Your classpath has some MapR jar. Is that intentional ? Cheers On Mon, Jan 19, 2015 at 6:58 AM, Jianguo Li wrote: > Hi, > > I created some unit tests to test some of the functions in my project > which use Spark. However, when I used the sbt tool to build it and then ran > the "sbt test", I ra

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-19 Thread critikaled
Hi, john and david I tried this to run them concurrently List(RDD1,RDD2,.).par.foreach{ rdd=> rdd.collect().foreach(println) } this was able to successfully register the task but the parallelism of the stages is limited it was able run 4 of them some time and only one of them some time whic

Re: Streaming with Java: Expected ReduceByWindow to Return JavaDStream

2015-01-19 Thread Jeff Nadler
For anyone who finds this later, looks like Jerry already took care of this here: https://issues.apache.org/jira/browse/SPARK-5315 Thanks! On Sun, Jan 18, 2015 at 10:28 PM, Shao, Saisai wrote: > Hi Jeff, > > > > From my understanding it seems more like a bug, since JavaDStreamLike is > used f

Spark SQL: Assigning several aliases to the output (several return values) of an UDF

2015-01-19 Thread mucks17
Hello I use Hive on Spark and have an issue with assigning several aliases to the output (several return values) of an UDF. I ran in several issues and ended up with a workaround (described at the end of this message). - Is assigning several aliases to the output of an UDF not supported by Sp

[SQL] Using HashPartitioner to distribute by column

2015-01-19 Thread Mick Davies
Is it possible to use a HashPartioner or something similar to distribute a SchemaRDDs data by the hash of a particular column or set of columns. Having done this I would then hope that GROUP BY could avoid shuffle E.g. set up a HashPartioner on CustomerCode field so that SELECT CustomerCode, SU

Re: If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?

2015-01-19 Thread Xuefeng Wu
I think it's always twice, could you provide some demo case for sometimes the RDD1 calculated only once? On Sat, Jan 17, 2015 at 2:37 AM, Peng Cheng wrote: > I'm talking about RDD1 (not persisted or checkpointed) in this situation: > > ...(somewhere) -> RDD1 -> RDD2 >

unit tests with "java.io.IOException: Could not create FileClient"

2015-01-19 Thread Jianguo Li
Hi, I created some unit tests to test some of the functions in my project which use Spark. However, when I used the sbt tool to build it and then ran the "sbt test", I ran into "java.io.IOException: Could not create FileClient": 2015-01-19 08:50:38,1894 ERROR Client fs/client/fileclient/cc/client

Spark SQL: Assigning several aliases to the output (several return values) of an UDF

2015-01-19 Thread mucks17
Hello I use Hive on Spark and have an issue with assigning several aliases to the output (several return values) of an UDF. I ran in several issues and ended up with a workaround (described at the end of this message). - Is assigning several aliases to the output of an UDF not supported by Spar

Spark 1.20 resource issue with Mesos .21.1

2015-01-19 Thread Brian Belgodere
1:44.118728 19317 sched.cpp:234] New master detected at master@192.0.3.12:5050 I0119 02:41:44.119282 19317 sched.cpp:242] No credentials provided. Attempting to register without authentication I0119 02:41:44.123064 19317 sched.cpp:408] Framework registered with 20150119-003609-201523392-5050-7198-000

Re: Guava 11 dependency issue in Spark 1.2.0

2015-01-19 Thread Ted Yu
Please see this thread: http://search-hadoop.com/m/LgpTk2aVYgr/Hadoop+guava+upgrade&subj=Re+Time+to+address+the+Guava+version+problem > On Jan 19, 2015, at 6:03 AM, Romi Kuntsman wrote: > > I have recently encountered a similar problem with Guava version collision > with Hadoop. > > Isn't it

Re: Guava 11 dependency issue in Spark 1.2.0

2015-01-19 Thread Romi Kuntsman
Actually there is already someone on Hadoop-Common-Dev taking care of removing the old Guava dependency http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201501.mbox/browser https://issues.apache.org/jira/browse/HADOOP-11470 *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com O

Re: Guava 11 dependency issue in Spark 1.2.0

2015-01-19 Thread Romi Kuntsman
I have recently encountered a similar problem with Guava version collision with Hadoop. Isn't it more correct to upgrade Hadoop to use the latest Guava? Why are they staying in version 11, does anyone know? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Wed, Jan 7, 2015 at 7:59

RE: Spark Streaming with Kafka

2015-01-19 Thread Shao, Saisai
Hi, Could you please describing your monitored symptom a little more specifically, like is there any exceptions you monitored with Kafka, Zookeeper, what other exceptions from Spark side you monitored? With only this short description can hardly find any clue. Thanks Jerry From: Eduardo Alfai

Suggestion needed for implementing DROOLS in spark

2015-01-19 Thread vishnu86
For our use case we need to implement DROOLS, there is no guides or tutorial, How we can implement DROOLS in spark? If someone has some idea how we can setup please guide me.. Thanks in Advance. Regards, Vishnu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.co

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-19 Thread Sean Owen
>From the OP: (1) val lines = Import full dataset using sc.textFile (2) val ABonly = Filter out all rows from "lines" that are not of type A or B (3) val processA = Process only the A rows from ABonly (4) val processB = Process only the B rows from ABonly I assume that 3 and 4 are actions, or els

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-19 Thread critikaled
+1, I too need to know. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-automatically-run-different-stages-concurrently-when-possible-tp21075p21233.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

is there documentation on spark sql catalyst?

2015-01-19 Thread critikaled
Where can I find a good documentation on sql catalyst? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/is-there-documentation-on-spark-sql-catalyst-tp21232.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

possible memory leak when re-creating SparkContext's in the same JVM

2015-01-19 Thread Noam Barcay
Problem we're seeing is a gradual memory leak in the driver's JVM. Executing jobs using a long running Java app which creates relatively short-lived SparkContext's. So our Spark drivers are created as part of this application's JVM. We're using standalone cluster mode, spark 1.0.2 Root cause of t

Re: UnknownhostException : home

2015-01-19 Thread Rapelly Kartheek
Yeah... I made that mistake in spark/conf/spark-defaults.conf for setting: spark.eventLog.dir. Now it works Thank you Karthik On Mon, Jan 19, 2015 at 3:29 PM, Sean Owen wrote: > Sorry, to be clear, you need to write "hdfs:///home/...". Note three > slashes; there is an empty host betw

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-19 Thread Sean Owen
The problem is clearly to do with the executor exceeding YARN allocations, so, this can't be in local mode. He said this was running on YARN at the outset. On Mon, Jan 19, 2015 at 2:27 AM, Raghavendra Pandey wrote: > If you are running spark in local mode, executor parameters are not used as > th

Re: Join DStream With Other Datasets

2015-01-19 Thread Sean Owen
I don't think this has anything to do with transferring anything from the driver, or per task. I'm talking about a singleton object in the JVM that loads whatever you want from wherever you want and holds it in memory once per JVM. That is, I do not think you have to use broadcast, or even any Spar

Re: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file

2015-01-19 Thread BB
I am quoting the reply I got on this - which for some reason did not get posted here. The suggestion in the reply below worked perfectly for me. The error mentioned in the reply is not related (or old). Hope this is helpful to someone. Cheers, BB > Hi, BB >Ideally you can do the query like:

Re: UnknownhostException : home

2015-01-19 Thread Sean Owen
Sorry, to be clear, you need to write "hdfs:///home/...". Note three slashes; there is an empty host between the 2nd and 3rd. This is true of most URI schemes with a host. On Mon, Jan 19, 2015 at 9:56 AM, Rapelly Kartheek wrote: > Yes yes.. hadoop/etc/hadoop/hdfs-site.xml file has the path like:

Re: UnknownhostException : home

2015-01-19 Thread Ashish
+1 to Sean suggestion On Mon, Jan 19, 2015 at 3:21 PM, Sean Owen wrote: > I bet somewhere you have a path like "hdfs://home/..." which would > suggest that 'home' is a hostname, when I imagine you mean it as a > root directory. > > On Mon, Jan 19, 2015 at 9:33 AM, Rapelly Kartheek > wrote: >> Hi

Re: UnknownhostException : home

2015-01-19 Thread Rapelly Kartheek
Yes yes.. hadoop/etc/hadoop/hdfs-site.xml file has the path like: "hdfs://home/..." On Mon, Jan 19, 2015 at 3:21 PM, Sean Owen wrote: > I bet somewhere you have a path like "hdfs://home/..." which would > suggest that 'home' is a hostname, when I imagine you mean it as a > root directory. > > On

Re: UnknownhostException : home

2015-01-19 Thread Sean Owen
I bet somewhere you have a path like "hdfs://home/..." which would suggest that 'home' is a hostname, when I imagine you mean it as a root directory. On Mon, Jan 19, 2015 at 9:33 AM, Rapelly Kartheek wrote: > Hi, > > I get the following exception when I run my application: > > karthik@karthik:~/s

Re: UnknownhostException : home

2015-01-19 Thread Rapelly Kartheek
Actually, I don't have any entry in my /etc/hosts file with hostname: "home". Infact, I didn't use this hostname naywhere. Then why is it that its trying to resolve this? On Mon, Jan 19, 2015 at 3:15 PM, Ashish wrote: > it's not able to resolve home to an IP. > Assuming it's your local machine,

Re: UnknownhostException : home

2015-01-19 Thread Ashish
it's not able to resolve home to an IP. Assuming it's your local machine, add an entry in your /etc/hosts file like and then run the program again (use sudo to edit the file) 127.0.0.1 home On Mon, Jan 19, 2015 at 3:03 PM, Rapelly Kartheek wrote: > Hi, > > I get the following exception when I ru

Fwd: UnknownhostException : home

2015-01-19 Thread Kartheek.R
-- Forwarded message -- From: Rapelly Kartheek Date: Mon, Jan 19, 2015 at 3:03 PM Subject: UnknownhostException : home To: "user@spark.apache.org" Hi, I get the following exception when I run my application: karthik@karthik:~/spark-1.2.0$ ./bin/spark-submit --class org.apache.

Re: Determine number of running executors

2015-01-19 Thread Tobias Pfeiffer
Hi, On Sat, Jan 17, 2015 at 3:05 AM, Shuai Zheng wrote: > > Can you share more information about how do you do that? I also have > similar question about this. > Not very proud about it ;-), but here you go: // find the number of workers available to us. val _runCmd = scala.util.Properties.prop

UnknownhostException : home

2015-01-19 Thread Rapelly Kartheek
Hi, I get the following exception when I run my application: karthik@karthik:~/spark-1.2.0$ ./bin/spark-submit --class org.apache.spark.examples.SimpleApp001 --deploy-mode client --master spark://karthik:7077 $SPARK_HOME/examples/*/scala-*/spark-examples-*.jar >out1.txt log4j:WARN No such propert

Re: Spark SQL & Parquet - data are reading very very slow

2015-01-19 Thread Mick Davies
Added a JIRA to track https://issues.apache.org/jira/browse/SPARK-5309 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Parquet-data-are-reading-very-very-slow-tp21061p21229.html Sent from the Apache Spark User List mailing list archive at Nabble.co

Is there any way to support multiple users executing SQL on thrift server?

2015-01-19 Thread Yi Tian
Is there any way to support multiple users executing SQL on one thrift server? I think there are some problems for spark 1.2.0, for example: 1. Start thrift server with user A 2. Connect to thrift server via beeline with user B 3. Execute “insert into table dest select … from table src” then w

Need some help to create user defined type for ML pipeline

2015-01-19 Thread Jaonary Rabarisoa
Hi all, I'm trying to implement a pipeline for computer vision based on the latest ML package in spark. The first step of my pipeline is to decode image (jpeg for instance) stored in a parquet file. For this, I begin to create a UserDefinedType that represents a decoded image stored in a array of

Re: Why custom parquet format hive table execute "ParquetTableScan" physical plan, not "HiveTableScan"?

2015-01-19 Thread Xiaoyu Wang
Yes it works! But the filter can't pushdown!!! If custom parquetinputformat only implement the datasource API? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 2015-01-16 21:51 GMT+08:00 Xiaoyu Wang : > Thanks yana! > I will try i

Re: Newbie Question on How Tasks are Executed

2015-01-19 Thread davidkl
Hello Mixtou, if you want to look at partition ID, I believe you want to use mapPartitionsWithIndex -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Newbie-Question-on-How-Tasks-are-Executed-tp21064p21228.html Sent from the Apache Spark User List mailing li

Re: Does Spark automatically run different stages concurrently when possible?

2015-01-19 Thread davidkl
Hi Jon, I am looking for an answer for a similar question in the doc now, so far no clue. I would need to know what is spark behaviour in a situation like the example you provided, but taking into account also that there are multiple partitions/workers. I could imagine it's possible that differen