Re: Spark-Ec2 launch failed on starting httpd spark 141

2015-08-25 Thread Ted Yu
Looks like it is this PR: https://github.com/mesos/spark-ec2/pull/133 On Tue, Aug 25, 2015 at 9:52 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah thats a know issue and we have a PR out to fix it. Shivaram On Tue, Aug 25, 2015 at 7:39 AM, Garry Chen g...@cornell.edu

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Kristina Rogale Plazonic
However I do think it's easier than it seems to write the implicits; it doesn't involve new classes or anything. Yes it's pretty much just what you wrote. There is a class Vector in Spark. This declaration can be in an object; you don't implement your own class. (Also you can use toBreeze to

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-25 Thread Utkarsh Sengar
This worked for me locally: spark-1.4.1-bin-hadoop2.4/bin/spark-submit --conf spark.executor.extraClassPath=/.m2/repository/ch/qos/logback/logback-core/1.1.2/logback-core-1.1.2.jar:/.m2/repository/ch/qos/logback/logback-classic/1.1.2/logback-classic-1.1.2.jar --conf

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-25 Thread Marcelo Vanzin
On Tue, Aug 25, 2015 at 10:48 AM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Now I am going to try it out on our mesos cluster. I assumed spark.executor.extraClassPath takes csv as jars the way --jars takes it but it should be : separated like a regular classpath jar. Ah, yes, those options

SparkR: exported functions

2015-08-25 Thread Colin Gillespie
Hi, I've just started playing about with SparkR (Spark 1.4.1), and noticed that a number of the functions haven't been exported. For example, the textFile function https://github.com/apache/spark/blob/master/R/pkg/R/context.R isn't exported, i.e. the function isn't in the NAMESPACE file. This

Re: Local Spark talking to remote HDFS?

2015-08-25 Thread Steve Loughran
I wouldn't try to play with forwarding tunnelling; always hard to work out what ports get used everywhere, and the services like hostname==URL in paths. Can't you just set up an entry in the windows /etc/hosts file? It's what I do (on Unix) to talk to VMs On 25 Aug 2015, at 04:49, Dino

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Kristina Rogale Plazonic
What about declaring a few simple implicit conversions between the MLlib and Breeze Vector classes? if you import them then you should be able to write a lot of the source code just as you imagine it, as if the Breeze methods were available on the Vector object in MLlib. The problem is that

CHAID Decision Trees

2015-08-25 Thread jatinpreet
Hi, I wish to know if MLlib supports CHAID regression and classifcation trees. If yes, how can I build them in spark? Thanks, Jatin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CHAID-Decision-Trees-tp24449.html Sent from the Apache Spark User List

Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Eric Friedman
Well, this is very strange. My only change is to add -X to make-distribution and it succeeds: % git diff (spark/spark) *diff --git a/make-distribution.sh b/make-distribution.sh* *index a2b0c43..351fac2 100755* *--- a/make-distribution.sh* *+++

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-25 Thread Nicholas Chammas
Final chance to fill out the survey! http://goo.gl/forms/erct2s6KRR I'm gonna close it to new responses tonight and send out a summary of the results. Nick On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I'm planning to close the survey to further responses

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Burak Yavuz
Hmm. I have a lot of code on the local linear algebra operations using Spark's Matrix and Vector representations done for https://issues.apache.org/jira/browse/SPARK-6442. I can make a Spark package with that code if people are interested. Best, Burak On Tue, Aug 25, 2015 at 10:54 AM, Kristina

Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread suchenzang
Hello, I'm using direct spark streaming (from kafka) with checkpointing, and everything works well until a restart. When I shut down (^C) the first streaming job, wait 1 minute, then re-submit, there is somehow a series of 0 event batches that get queued (corresponding to the 1 minute when the

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Kristina Rogale Plazonic
YES PLEASE! :))) On Tue, Aug 25, 2015 at 1:57 PM, Burak Yavuz brk...@gmail.com wrote: Hmm. I have a lot of code on the local linear algebra operations using Spark's Matrix and Vector representations done for https://issues.apache.org/jira/browse/SPARK-6442. I can make a Spark package

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread Cody Koeninger
Does the first batch after restart contain all the messages received while the job was down? On Tue, Aug 25, 2015 at 12:53 PM, suchenzang suchenz...@gmail.com wrote: Hello, I'm using direct spark streaming (from kafka) with checkpointing, and everything works well until a restart. When I

How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-25 Thread Mike Trienis
Hello, I am using sbt and created a unit test where I create a `HiveContext` and execute some query and then return. Each time I run the unit test the JVM will increase it's memory usage until I get the error: Internal error when running tests: java.lang.OutOfMemoryError: PermGen space Exception

DataFrame Parquet Writer doesn't keep schema

2015-08-25 Thread Petr Novak
Hi all, when I read parquet files with required fields aka nullable=false they are read correctly. Then I save them (df.write.parquet) and read again all my fields are saved and read as optional, aka nullable=true. Which means I suddenly have files with incompatible schemas. This happens on

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Sean Owen
Yes, you're right that it's quite on purpose to leave this API to Breeze, in the main. As you can see the Spark objects have already sprouted a few basic operations anyway; there's a slippery slope problem here. Why not addition, why not dot products, why not determinants, etc. What about

Re: Spark-Ec2 launch failed on starting httpd spark 141

2015-08-25 Thread Ted Yu
Corrected a typo in the subject of your email. What you cited seems to be from worker node startup. Was there other error you saw ? Please list the command you used. Cheers On Tue, Aug 25, 2015 at 7:39 AM, Garry Chen g...@cornell.edu wrote: Hi All, I am trying to lunch a

Spark RDD join with CassandraRDD

2015-08-25 Thread Priya Ch
Hi All, I have the following scenario: There exists a booking table in cassandra, which holds the fields like, bookingid, passengeName, contact etc etc. Now in my spark streaming application, there is one class Booking which acts as a container and holds all the field details - class

Spark-Ec2 lunch failed on starting httpd spark 141

2015-08-25 Thread Garry Chen
Hi All, I am trying to lunch a spark cluster on ec2 with spark 1.4.1 version. The script finished but getting error at the end as following. What should I do to correct this issue. Thank you very much for your input. Starting httpd: httpd: Syntax error on line 199 of

Re: How to access Spark UI through AWS

2015-08-25 Thread Kelly, Jonathan
I'm not sure why the UI appears broken like that either and haven't investigated it myself yet, but if you instead go to the YARN ResourceManager UI (port 8088 if you are using emr-4.x; port 9026 for 3.x, I believe), then you should be able to click on the ApplicationMaster link (or the History

Spark (1.2.0) submit fails with exception saying log directory already exists

2015-08-25 Thread Varadhan, Jawahar
Here is the error yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: Log directory hdfs://Sandbox/user/spark/applicationHistory/application_1438113296105_0302 already exists!) I am using cloudera 5.3.2 with Spark 1.2.0 Any help is appreciated.

[SQL/Hive] Trouble with refreshTable

2015-08-25 Thread Yana Kadiyska
I'm having trouble with refreshTable, I suspect because I'm using it incorrectly. I am doing the following: 1. Create DF from parquet path with wildcards, e.g. /foo/bar/*.parquet 2. use registerTempTable to register my dataframe 3. A new file is dropped under /foo/bar/ 4. Call

Error:(46, 66) not found: type SparkFlumeProtocol

2015-08-25 Thread Muler
I'm trying to build Spark using Intellij on Windows. But I'm repeatedly getting this error spark-master\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala Error:(46, 66) not found: type SparkFlumeProtocol val transactionTimeout: Int, val

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Sean Owen
Yes I get all that too and I think there's a legit question about whether moving a little further down the slippery slope is worth it and if so how far. The other catch here is: either you completely mimic another API (in which case why not just use it directly, which has its own problems) or you

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Kristina Rogale Plazonic
Well, yes, the hack below works (that's all I have time for), but is not satisfactory - it is not safe, and is verbose and very cumbersome to use, does not separately deal with SparseVector case and is not complete either. My question is, out of hundreds of users on this list, someone must have

Re: How to effieciently write sorted neighborhood in pyspark

2015-08-25 Thread shahid qadri
Any resources on this On Aug 25, 2015, at 3:15 PM, shahid qadri shahidashr...@icloud.com wrote: I would like to implement sorted neighborhood approach in spark, what is the best way to write that in pyspark. - To

Re: CHAID Decision Trees

2015-08-25 Thread Jatinpreet Singh
Hi Feynman, Thanks for the information. Is there a way to depict decision tree as a visualization for large amounts of data using any other technique/library? Thanks, Jatin On Tue, Aug 25, 2015 at 11:42 PM, Feynman Liang fli...@databricks.com wrote: Nothing is in JIRA

Re: use GraphX with Spark Streaming

2015-08-25 Thread ponkin
Hi, Sure you can. StreamingContext has property /def sparkContext: SparkContext/(see docs http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext ). Think about DStream - main abstraction in Spark Streaming, as a sequence of RDD. Each DStream can be

Question on take function - Spark Java API

2015-08-25 Thread Pankaj Wahane
Hi community members, Apache Spark is Fantastic and very easy to learn.. Awesome work!!! Question: I have multiple files in a folder and and the first line in each file is name of the asset that the file belongs to. Second line is csv header row and data starts from third row.. Ex:

Re: How to increase data scale in Spark SQL Perf

2015-08-25 Thread Ted Yu
The error in #1 below was not informative. Are you able to get more detailed error message ? Thanks On Aug 25, 2015, at 6:57 PM, Todd bit1...@163.com wrote: Thanks Ted Yu. Following are the error message: 1. The exception that is shown on the UI is : Exception in thread Thread-113

Re:Re: How to increase data scale in Spark SQL Perf

2015-08-25 Thread Todd
I think the answer is No. I only see such message on the console..and #2 is the thread stack trace。 I am thinking is that in Spark SQL Perf forks many dsdgen process to generate data when the scalafactor is increased which at last exhaust the JVM When thread exception is thrown on the console

Re: How to access Spark UI through AWS

2015-08-25 Thread Justin Pihony
I figured it all out after this: http://apache-spark-user-list.1001560.n3.nabble.com/WebUI-on-yarn-through-ssh-tunnel-affected-by-AmIpfilter-td21540.html The short is that I needed to set SPARK_PUBLIC_DNS (not DNS_HOME) = ec2_publicdns then the YARN proxy gets in the way, so I needed to go to:

Re: CHAID Decision Trees

2015-08-25 Thread Feynman Liang
For a single decision tree, the closest I can think of is printDebugString, which gives you a text representation of the decision thresholds and paths down the tree. I don't think there's anything in MLlib for visualizing GBTs or random forests On Tue, Aug 25, 2015 at 9:20 PM, Jatinpreet Singh

Re: Exception throws when running spark pi in Intellij Idea that scala.collection.Seq is not found

2015-08-25 Thread Hemant Bhanawat
Go to the module settings of the project and in the dependencies section check the scope of scala jars. It would be either Test or Provided. Change it to compile and it should work. Check the following link to understand more about scope of modules:

Re: Exception throws when running spark pi in Intellij Idea that scala.collection.Seq is not found

2015-08-25 Thread Jeff Zhang
As I remember, you also need to change guava and jetty related dependency to compile if you run to run SparkPi in intellij. On Tue, Aug 25, 2015 at 3:15 PM, Hemant Bhanawat hemant9...@gmail.com wrote: Go to the module settings of the project and in the dependencies section check the scope of

Re: Spark stages very slow to complete

2015-08-25 Thread Olivier Girardot
I have pretty much the same symptoms - the computation itself is pretty fast, but most of my computation is spent in JavaToPython steps (~15min). I'm using the Spark 1.5.0-rc1 with DataFrame and ML Pipelines. Any insights into what these steps are exactly ? 2015-06-02 9:18 GMT+02:00 Karlson

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-25 Thread Cheng, Hao
Ok, I see, thanks for the correction, but this should be optimized. From: Shixiong Zhu [mailto:zsxw...@gmail.com] Sent: Tuesday, August 25, 2015 2:08 PM To: Cheng, Hao Cc: Jeff Zhang; user@spark.apache.org Subject: Re: DataFrame#show cost 2 Spark Jobs ? That's two jobs. `SparkPlan.executeTake`

Re:Re: Exception throws when running spark pi in Intellij Idea that scala.collection.Seq is not found

2015-08-25 Thread Todd
Thanks you guys. Yes, I have fixed the guava and spark core and scala and jetty. And I can run Pi now. At 2015-08-25 15:28:51, Jeff Zhang zjf...@gmail.com wrote: As I remember, you also need to change guava and jetty related dependency to compile if you run to run SparkPi in intellij.

Re: Loading already existing tables in spark shell

2015-08-25 Thread Jeetendra Gangele
In spark shell use database not working saying use not found in the shell? did you ran this with scala shell ? On 24 August 2015 at 18:26, Ishwardeep Singh ishwardeep.si...@impetus.co.in wrote: Hi Jeetendra, I faced this issue. I did not specify the database where this table exists.

Invalid environment variable name when submitting job from windows

2015-08-25 Thread Yann ROBIN
Hi, We have a spark standalone cluster running on linux. We have a job that we submit to the spark cluster on windows. When submitting this job using windows the execution failed with this error in the Notes java.lang.IllegalArgumentException: Invalid environment variable name: =::. When

Re: What does Attribute and AttributeReference mean in Spark SQL

2015-08-25 Thread Michael Armbrust
Attribute is the Catalyst name for an input column from a child operator. An AttributeReference has been resolved, meaning we know which input column in particular it is referring too. An AttributeReference also has a known DataType. In contrast, before analysis there might still exist

Re: Local Spark talking to remote HDFS?

2015-08-25 Thread Roberto Congiu
Port 8020 is not the only port you need tunnelled for HDFS to work. If you only list the contents of a directory, port 8020 is enough... for instance, using something val p = new org.apache.hadoop.fs.Path(hdfs://localhost:8020/) val fs = p.getFileSystem(sc.hadoopConfiguration) fs.listStatus(p)

Re: How to set environment of worker applications

2015-08-25 Thread Hemant Bhanawat
Ok, I went in the direction of system vars since beginning probably because the question was to pass variables to a particular job. Anyway, the decision to use either system vars or environment vars would solely depend on whether you want to make them available to all the spark processes on a

Re: Loading already existing tables in spark shell

2015-08-25 Thread Ishwardeep Singh
Hi Jeetendra, Please try the following in spark shell. it is like executing an sql command. sqlContext.sql(use database name) Regards, Ishwardeep From: Jeetendra Gangele gangele...@gmail.com Sent: Tuesday, August 25, 2015 12:57 PM To: Ishwardeep Singh

Re: spark not launching in yarn-cluster mode

2015-08-25 Thread Yanbo Liang
spark-shell and spark-sql can not be deployed with yarn-cluster mode, because you need to make spark-shell or spark-sql scripts run on your local machine rather than container of YARN cluster. 2015-08-25 16:19 GMT+08:00 Jeetendra Gangele gangele...@gmail.com: Hi All i am trying to launch the

Re: org.apache.spark.shuffle.FetchFailedException

2015-08-25 Thread kundan kumar
I have set spark.sql.shuffle.partitions=1000 then also its failing. On Tue, Aug 25, 2015 at 11:36 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: Did you try increasing sql partitions? On Tue, Aug 25, 2015 at 11:06 AM, kundan kumar iitr.kun...@gmail.com wrote: I am running

Re:RE: Test case for the spark sql catalyst

2015-08-25 Thread Todd
Thanks Chenghao! At 2015-08-25 13:06:40, Cheng, Hao hao.ch...@intel.com wrote: Yes, check the source code under:https://github.com/apache/spark/tree/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst From: Todd [mailto:bit1...@163.com] Sent: Tuesday, August 25, 2015 1:01

Exception throws when running spark pi in Intellij Idea that scala.collection.Seq is not found

2015-08-25 Thread Todd
I cloned the code from https://github.com/apache/spark to my machine. It can compile successfully, But when I run the sparkpi, it throws an exception below complaining the scala.collection.Seq is not found. I have installed scala2.10.4 in my machine, and use the default profiles:

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-25 Thread Cheng, Hao
O, Sorry, I miss reading your reply! I know the minimum tasks will be 2 for scanning, but Jeff is talking about 2 jobs, not 2 tasks. From: Shixiong Zhu [mailto:zsxw...@gmail.com] Sent: Tuesday, August 25, 2015 1:29 PM To: Cheng, Hao Cc: Jeff Zhang; user@spark.apache.org Subject: Re:

Re: org.apache.spark.shuffle.FetchFailedException

2015-08-25 Thread Raghavendra Pandey
Did you try increasing sql partitions? On Tue, Aug 25, 2015 at 11:06 AM, kundan kumar iitr.kun...@gmail.com wrote: I am running this query on a data size of 4 billion rows and getting org.apache.spark.shuffle.FetchFailedException error. select adid,position,userid,price from ( select

Re: Spark

2015-08-25 Thread Sonal Goyal
Sorry am I missing something? There is a method sortBy on both RDD and PairRDD. def sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length

Re: DataFrame#show cost 2 Spark Jobs ?

2015-08-25 Thread Shixiong Zhu
That's two jobs. `SparkPlan.executeTake` will call `runJob` twice in this case. Best Regards, Shixiong Zhu 2015-08-25 14:01 GMT+08:00 Cheng, Hao hao.ch...@intel.com: O, Sorry, I miss reading your reply! I know the minimum tasks will be 2 for scanning, but Jeff is talking about 2 jobs, not

Re: How can I save the RDD result as Orcfile with spark1.3?

2015-08-25 Thread dong.yajun
We plan to upgrade our spark cluster to 1.4, and I just have a test in local mode which reference here: http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/ but an exception caused when running the example, the stack trace as below: *Exception in thread main

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread Cody Koeninger
Are you actually losing messages then? On Tue, Aug 25, 2015 at 1:15 PM, Susan Zhang suchenz...@gmail.com wrote: No; first batch only contains messages received after the second job starts (messages come in at a steady rate of about 400/second). On Tue, Aug 25, 2015 at 11:07 AM, Cody

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread Susan Zhang
Yeah. All messages are lost while the streaming job was down. On Tue, Aug 25, 2015 at 11:37 AM, Cody Koeninger c...@koeninger.org wrote: Are you actually losing messages then? On Tue, Aug 25, 2015 at 1:15 PM, Susan Zhang suchenz...@gmail.com wrote: No; first batch only contains messages

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread Cody Koeninger
Sounds like something's not set up right... can you post a minimal code example that reproduces the issue? On Tue, Aug 25, 2015 at 1:40 PM, Susan Zhang suchenz...@gmail.com wrote: Yeah. All messages are lost while the streaming job was down. On Tue, Aug 25, 2015 at 11:37 AM, Cody Koeninger

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread Susan Zhang
No; first batch only contains messages received after the second job starts (messages come in at a steady rate of about 400/second). On Tue, Aug 25, 2015 at 11:07 AM, Cody Koeninger c...@koeninger.org wrote: Does the first batch after restart contain all the messages received while the job was

Re: CHAID Decision Trees

2015-08-25 Thread Feynman Liang
Nothing is in JIRA https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22CHAID%22 so AFAIK no, only random forests and GBTs using entropy or GINI for information gain is supported. On Tue, Aug 25, 2015 at 9:39 AM, jatinpreet jatinpr...@gmail.com wrote: Hi, I

Fwd: Join with multiple conditions (In reference to SPARK-7197)

2015-08-25 Thread Michal Monselise
Hello All, PySpark currently has two ways of performing a join: specifying a join condition or column names. I would like to perform a join using a list of columns that appear in both the left and right DataFrames. I have created an example in this question on Stack Overflow

Re: Local Spark talking to remote HDFS?

2015-08-25 Thread Roberto Congiu
That's what I'd suggest too. Furthermore, if you use vagrant to spin up VMs, there's a module that can do that automatically for you. R. 2015-08-25 10:11 GMT-07:00 Steve Loughran ste...@hortonworks.com: I wouldn't try to play with forwarding tunnelling; always hard to work out what ports get

Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread Nick Pentreath
While it's true locality might speed things up, I'd say it's a very bad idea to mix your Spark and ES clusters - if your ES cluster is serving production queries (and in particular using aggregations), you'll run into performance issues on your production ES cluster. ES-hadoop uses ES scan

How to effieciently write sorted neighborhood in pyspark

2015-08-25 Thread shahid qadri
I would like to implement sorted neighborhood approach in spark, what is the best way to write that in pyspark. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: Using unserializable classes in tasks

2015-08-25 Thread Akhil Das
Instead of foreach try to use forEachPartitions, that will initialize the connector per partition rather than per record. Thanks Best Regards On Fri, Aug 14, 2015 at 1:13 PM, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote: No the connector does not need to be serializable cause it is

Re: Exception when S3 path contains colons

2015-08-25 Thread Akhil Das
You can change the names, whatever program that is pushing the record must follow the naming conventions. Try to replace : with _ or something. Thanks Best Regards On Tue, Aug 18, 2015 at 10:20 AM, Brian Stempin brian.stem...@gmail.com wrote: Hi, I'm running Spark on Amazon EMR (Spark 1.4.1,

Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread Akhil Das
If the data is local to the machine then obviously it will be faster compared to pulling it through the network and storing it locally (either memory or disk etc). Have a look at the data locality

Re: Spark Streaming: Some issues (Could not compute split, block —— not found) and questions

2015-08-25 Thread Akhil Das
You hit block not found issues when you processing time exceeds the batch duration (this happens with receiver oriented streaming). If you are consuming messages from Kafka then try to use the directStream or you can also set StorageLevel to MEMORY_AND_DISK with receiver oriented consumer. (This

Re:Re: What does Attribute and AttributeReference mean in Spark SQL

2015-08-25 Thread Todd
Thank you Michael for the detail explanation, it makes clear to me. Thanks! At 2015-08-25 15:37:54, Michael Armbrust mich...@databricks.com wrote: Attribute is the Catalyst name for an input column from a child operator. An AttributeReference has been resolved, meaning we know which input

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Feynman Liang
Kristina, Thanks for the discussion. I followed up on your problem and learned that Scala doesn't support multiple implicit conversions in a single expression http://stackoverflow.com/questions/8068346/can-scala-apply-multiple-implicit-conversions-in-one-expression for complexity reasons. I'm

Re: Protobuf error when streaming from Kafka

2015-08-25 Thread Cassa L
Hi, I am using Spark-1.4 and Kafka-0.8.2.1 As per google suggestions, I rebuilt all the classes with protobuff-2.5 dependencies. My new protobuf is compiled using 2.5. However now, my spark job does not start. Its throwing different error. Does Spark or any other its dependencies uses old

RE: Protobuf error when streaming from Kafka

2015-08-25 Thread java8964
Did your spark build with Hive? I met the same problem before because the hive-exec jar in the maven itself include protobuf class, which will be included in the Spark jar. Yong Date: Tue, 25 Aug 2015 12:39:46 -0700 Subject: Re: Protobuf error when streaming from Kafka From: lcas...@gmail.com

Re: Protobuf error when streaming from Kafka

2015-08-25 Thread Cassa L
I downloaded below binary version of spark. spark-1.4.1-bin-cdh4 On Tue, Aug 25, 2015 at 1:03 PM, java8964 java8...@hotmail.com wrote: Did your spark build with Hive? I met the same problem before because the hive-exec jar in the maven itself include protobuf class, which will be included in

Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-25 Thread Yana Kadiyska
The PermGen space error is controlled with MaxPermSize parameter. I run with this in my pom, I think copied pretty literally from Spark's own tests... I don't know what the sbt equivalent is but you should be able to pass it...possibly via SBT_OPTS? plugin

Re: How to access Spark UI through AWS

2015-08-25 Thread Justin Pihony
Thanks. I just tried and still am having trouble. It seems to still be using the private address even if I try going through the resource manager. On Tue, Aug 25, 2015 at 12:34 PM, Kelly, Jonathan jonat...@amazon.com wrote: I'm not sure why the UI appears broken like that either and haven't

SparkSQL problem with IBM BigInsight V3

2015-08-25 Thread java8964
Hi, On our production environment, we have a unique problems related to Spark SQL, and I wonder if anyone can give me some idea what is the best way to handle this. Our production Hadoop cluster is IBM BigInsight Version 3, which comes with Hadoop 2.2.0 and Hive 0.12. Right now, we build spark

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread Susan Zhang
Sure thing! The main looks like: -- val kafkaBrokers = conf.getString(s$varPrefix.metadata.broker.list) val kafkaConf = Map( zookeeper.connect - zookeeper, group.id - options.group,

Re: Join with multiple conditions (In reference to SPARK-7197)

2015-08-25 Thread Davies Liu
It's good to support this, could you create a JIRA for it and target for 1.6? On Tue, Aug 25, 2015 at 11:21 AM, Michal Monselise michal.monsel...@gmail.com wrote: Hello All, PySpark currently has two ways of performing a join: specifying a join condition or column names. I would like to

Re: Too many files/dirs in hdfs

2015-08-25 Thread Mohit Anchlia
Based on what I've read it appears that when using spark streaming there is no good way of optimizing the files on HDFS. Spark streaming writes many small files which is not scalable in apache hadoop. Only other way seem to be to read files after it has been written and merge them to a bigger

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-25 Thread Utkarsh Sengar
So do I need to manually copy these 2 jars on my spark executors? On Tue, Aug 25, 2015 at 10:51 AM, Marcelo Vanzin van...@cloudera.com wrote: On Tue, Aug 25, 2015 at 10:48 AM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Now I am going to try it out on our mesos cluster. I assumed

Checkpointing in Iterative Graph Computation

2015-08-25 Thread sachintyagi22
Hi, I have stumbled upon an issue with iterative Graphx computation (using v 1.4.1). It goes thusly -- Setup 1. Construct a graph. 2. Validate that the graph satisfies certain conditions. Here I do some assert(*conditions*) within graph.triplets.foreach(). [Notice that this materializes the

Select some data from Hive (SparkSQL) directly using NodeJS

2015-08-25 Thread Phakin Cheangkrachange
Hi, I just wonder if there's any way that I can get some sample data (10-20 rows) out of Spark's Hive using NodeJs? Submitting a spark job to show 20 rows of data in web page is not good for me. I've set up Spark Thrift Server as shown in Spark Doc. The server works because I can use *beeline*

Re: Performance - Python streaming v/s Scala streaming

2015-08-25 Thread Utkarsh Patkar
Thanks for the quick response. I have tried the direct word count python example and it also seems to be slow. Lot of times it is not fetching the words that are sent by the producer. I am using SPARK version 1.4.1 and KAFKA 2.10-0.8.2.0. On Tue, Aug 25, 2015 at 2:05 AM, Tathagata Das

Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread gen tang
Great advice. Thanks a lot Nick. In fact, if we use rdd.persist(DISK) command at the beginning of the program to avoid hitting the network again and again. The speed is not influenced a lot. In my case, it is just 1 min more compared to the situation that we put the data in local HDFS. Cheers

Re: spark not launching in yarn-cluster mode

2015-08-25 Thread Jeetendra Gangele
when I am launching with yarn-client also its giving me below error bin/spark-sql --master yarn-client 15/08/25 13:53:20 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED! Exception in thread Yarn application state monitor org.apache.spark.SparkException:

Re: How to increase data scale in Spark SQL Perf

2015-08-25 Thread Ted Yu
Looks like you were attaching images to your email which didn't go through. Consider using third party site for images - or paste error in text. Cheers On Tue, Aug 25, 2015 at 4:22 AM, Todd bit1...@163.com wrote: Hi, The spark sql perf itself contains benchmark data generation. I am using

Re: Spark Streaming failing on YARN Cluster

2015-08-25 Thread Ramkumar V
yes , when i see my yarn logs for that particular failed app_id, i got the following error. ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 10 ms. Please check earlier log output for errors. Failing the application For this error, I need to change the

Re: Exception when S3 path contains colons

2015-08-25 Thread Gourav Sengupta
I am not quite sure about this but should the notation not be s3n://redactedbucketname/* instead of s3a://redactedbucketname/* The best way is to use s3://bucketname/path/* Regards, Gourav On Tue, Aug 25, 2015 at 10:35 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can change the names,

using Convert function of sql in spark sql

2015-08-25 Thread Rajeshkumar J
Hi All, I want to use Convert() function in sql in one of my spark sql query. Can any one tell me whether it is supported or not?

How to increase data scale in Spark SQL Perf

2015-08-25 Thread Todd
Hi, The spark sql perf itself contains benchmark data generation. I am using spark shell to run the spark sql perf to generate the data with 10G memory for both driver and executor. When I increase the scalefactor to be 30,and run the job, Then I got the following error: When I jstack it to

Re: Local Spark talking to remote HDFS?

2015-08-25 Thread Dino Fancellu
Tried adding 50010, 50020 and 50090. Still no difference. I can't imagine I'm the only person on the planet wanting to do this. Anyway, thanks for trying to help. Dino. On 25 August 2015 at 08:22, Roberto Congiu roberto.con...@gmail.com wrote: Port 8020 is not the only port you need tunnelled

Re: Exception when S3 path contains colons

2015-08-25 Thread Romi Kuntsman
Hello, We had the same problem. I've written a blog post with the detailed explanation and workaround: http://labs.totango.com/spark-read-file-with-colon/ Greetings, Romi K. On Tue, Aug 25, 2015 at 2:47 PM Gourav Sengupta gourav.sengu...@gmail.com wrote: I am not quite sure about this but

SparkSQL saveAsParquetFile does not preserve AVRO schema

2015-08-25 Thread storm
Hi, I have serious problems with saving DataFrame as parquet file. I read the data from the parquet file like this: val df = sparkSqlCtx.parquetFile(inputFile.toString) and print the schema (you can see both fields are required) root |-- time: long (nullable = false) |-- time_ymdhms: long

Re: Protobuf error when streaming from Kafka

2015-08-25 Thread Cassa L
Do you think this binary would have issue? Do I need to build spark from source code? On Tue, Aug 25, 2015 at 1:06 PM, Cassa L lcas...@gmail.com wrote: I downloaded below binary version of spark. spark-1.4.1-bin-cdh4 On Tue, Aug 25, 2015 at 1:03 PM, java8964 java8...@hotmail.com wrote: Did

Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Rick Moritz
A quick question regarding this: how come the artifacts (spark-core in particular) on Maven Central are built with JDK 1.6 (according to the manifest), if Java 7 is required? On Aug 21, 2015 5:32 PM, Sean Owen so...@cloudera.com wrote: Spark 1.4 requires Java 7. On Fri, Aug 21, 2015, 3:12 PM

Persisting sorted parquet tables for future sort merge joins

2015-08-25 Thread Jason
I want to persist a large _sorted_ table to Parquet on S3 and then read this in and join it using the Sorted Merge Join strategy against another large sorted table. The problem is: even though I sort these tables on the join key beforehand, once I persist them to Parquet, they lose the

Spark thrift server on yarn

2015-08-25 Thread Udit Mehta
Hi, I am trying to start a spark thrift server using the following command on Spark 1.3.1 running on yarn: * ./sbin/start-thriftserver.sh --master yarn://resourcemanager.snc1:8032 --executor-memory 512m --hiveconf hive.server2.thrift.bind.host=test-host.sn1 --hiveconf

RE: Spark thrift server on yarn

2015-08-25 Thread Cheng, Hao
Did you register temp table via the beeline or in a new Spark SQL CLI? As I know, the temp table cannot cross the HiveContext. Hao From: Udit Mehta [mailto:ume...@groupon.com] Sent: Wednesday, August 26, 2015 8:19 AM To: user Subject: Spark thrift server on yarn Hi, I am trying to start a

Re: Spark thrift server on yarn

2015-08-25 Thread Udit Mehta
I registered it in a new Spark SQL CLI. Yeah I thought so too about how the temp tables were accessible across different applications without using a job-server. I see that running* HiveThriftServer2.startWithContext(hiveContext) *within the spark app starts up a thrift server. On Tue, Aug 25,

Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Sean Owen
-cdh-user This suggests that Maven is still using Java 6. I think this is indeed controlled by JAVA_HOME. Use 'mvn -X ...' to see a lot more about what is being used and why. I still suspect JAVA_HOME is not visible to the Maven process. Or maybe you have JRE 7 installed but not JDK 7 and it's

Scala: Overload method by its class type

2015-08-25 Thread Saif.A.Ellafi
Hi all, I have SomeClass[TYPE] { def some_method(args: fixed_type_args): TYPE } And on runtime, I create instances of this class with different AnyVal + String types, but the return type of some_method varies. I know I could do this with an implicit object, IF some_method received a type, but

  1   2   >