Help connecting to the cluster

2014-03-07 Thread Yana Kadiyska
Hi Spark users, could someone help me out. My company has a fully functioning spark cluster with shark running on top of it (as part of the same cluster, on the same LAN) . I'm interested in running raw spark code against it but am running against the following issue -- it seems like the machine

Re: Spark stand alone cluster mode

2014-03-11 Thread Yana Kadiyska
does sbt show full-classpath show spark-core on the classpath? I am still pretty new to scala but it seems like you have val sparkCore = org.apache.spark %% spark-core% V.spark % provided -- I believe the provided part means it's in your classpath. Spark-shell script sets up

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Yana Kadiyska
I am able to run standalone apps. I think you are making one mistake that throws you off from there onwards. You don't need to put your app under SPARK_HOME. I would create it in its own folder somewhere, it follows the rules of any standalone scala program (including the layout). In the giude,

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Yana Kadiyska
, and someone will point out that I'm missing a small step that will make the Spark distribution of sbt work! Diana On Mon, Mar 24, 2014 at 4:52 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Diana, I just tried it on a clean Ubuntu machine, with Spark 0.8 (since like other folks I had sbt

Re: Writing RDDs to HDFS

2014-03-24 Thread Yana Kadiyska
Ognen, can you comment if you were actually able to run two jobs concurrently with just restricting spark.cores.max? I run Shark on the same cluster and was not able to see a standalone job get in (since Shark is a long running job) until I restricted both spark.cores.max _and_

Re: Distributed running in Spark Interactive shell

2014-03-26 Thread Yana Kadiyska
Nan (or anyone who feels they understand the cluster architecture well), can you clarify something for me. From reading this user group and your explanation above it appears that the cluster master is only involved in this during application startup -- to allocate executors(from what you wrote

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Yana Kadiyska
Nicholas, I'm in Boston and would be interested in a Spark group. Not sure if you know this -- there was a meetup that never got off the ground. Anyway, I'd be +1 for attending. Not sure what is involved in organizing. Seems a shame that a city like Boston doesn't have one. On Mon, Mar 31, 2014

Re: Sample Project for using Shark API in Spark programs

2014-04-07 Thread Yana Kadiyska
I might be wrong here but I don't believe it's discouraged. Maybe part of the reason there's not a lot of examples is that sql2rdd returns an RDD (TableRDD that is https://github.com/amplab/shark/blob/master/src/main/scala/shark/SharkContext.scala). I haven't done anything too complicated yet but

Re: Urgently need help interpreting duration

2014-04-08 Thread Yana Kadiyska
times, or if it appeared that the launch times were serial with respect to the durations. If for some reason Spark started using only one executor, say, each task would take the same duration but would be executed one after another. On Tue, Apr 8, 2014 at 8:11 AM, Yana Kadiyska yana.kadiy

Re: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-02 Thread Yana Kadiyska
I think what you want to do is set spark.driver.port to a fixed port. On Fri, May 2, 2014 at 1:52 PM, Andrew Lee alee...@hotmail.com wrote: Hi All, I encountered this problem when the firewall is enabled between the spark-shell and the Workers. When I launch spark-shell in yarn-client

Shark resilience to unusable slaves

2014-05-22 Thread Yana Kadiyska
Hi, I am running into a pretty concerning issue with Shark (granted I'm running v. 0.8.1). I have a Spark slave node that has run out of disk space. When I try to start Shark it attempts to deploy the application to a directory on that node, fails and eventually gives up (I see a Master Removed

Re: Running Jars on Spark, program just hanging there

2014-05-27 Thread Yana Kadiyska
Does the spark UI show your program running? (http://spark-masterIP:8118). If the program is listed as running you should be able to see details via the UI. In my experience there are 3 sets of logs -- the log where you're running your program (the driver), the log on the master node, and the log

Master not seeing recovered nodes(Got heartbeat from unregistered worker ....)

2014-06-13 Thread Yana Kadiyska
Hi, I see this has been asked before but has not gotten any satisfactory answer so I'll try again: (here is the original thread I found: http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3c1394044078706-2312.p...@n3.nabble.com%3E ) I have a set of workers dying and coming back

Need some Streaming help

2014-06-16 Thread Yana Kadiyska
Like many people, I'm trying to do hourly counts. The twist is that I don't want to count per hour of streaming, but per hour of the actual occurrence of the event (wall clock, say -mm-dd HH). My thought is to make the streaming window large enough that a full hour of streaming data would fit

Help with object access from mapper (simple question)

2014-06-23 Thread Yana Kadiyska
Hi folks, hoping someone can explain to me what's going on: I have the following code, largely based on RecoverableNetworkWordCount example ( https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala ): I am setting

Re: Help with object access from mapper (simple question)

2014-06-23 Thread Yana Kadiyska
. class Foo extends Serializable { ... } val foo = new Foo() foo.field1 = blah lines.map(line = { println(foo) }) // now you should see the field values you set. On Mon, Jun 23, 2014 at 7:44 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, hoping someone can explain to me

Help understanding spark.task.maxFailures

2014-06-30 Thread Yana Kadiyska
Hi community, this one should be an easy one: I have left spark.task.maxFailures to it's default (which should be 4). I see a job that shows the following statistics for Tasks: Succeeded/Total 7109/819 (1 failed) So there were 819 tasks to start with. I have 2 executors in that cluster. From

Re: Changing log level of spark

2014-07-01 Thread Yana Kadiyska
Are you looking at the driver log? (e.g. Shark?). I see a ton of information in the INFO category on what query is being started, what stage is starting and which executor stuff is sent to. So I'm not sure if you're saying you see all that and you need more, or that you're not seeing this type of

Re: Spark Streaming question batch size

2014-07-01 Thread Yana Kadiyska
Are you saying that both streams come in at the same rate and you have the same batch interval but the batch size ends up different? i.e. two datapoints both arriving at X seconds after streaming starts end up in two different batches? How do you define real time values for both streams? I am

Re: Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-01 Thread Yana Kadiyska
A lot of things can get funny when you run distributed as opposed to local -- e.g. some jar not making it over. Do you see anything of interest in the log on the executor machines -- I'm guessing 192.168.222.152/192.168.222.164. From here

Re: Help alleviating OOM errors

2014-07-02 Thread Yana Kadiyska
as it gets but still going on 32+ hours. On Wed, Jul 2, 2014 at 2:40 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Mon, Jun 30, 2014 at 8:09 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi, our

Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector

2014-07-02 Thread Yana Kadiyska
The scripts that Xiangrui mentions set up the classpath...Can you run ./run-example for the provided example sucessfully? What you can try is set SPARK_PRINT_LAUNCH_COMMAND=1 and then call run-example -- that will show you the exact java command used to run the example at the start of execution.

Re: Cannot submit to a Spark Application to a remote cluster Spark 1.0

2014-07-09 Thread Yana Kadiyska
class java.io.IOException: Cannot run program /Users/aris.vlasakakis/Documents/spark-1.0.0/bin/compute-classpath.sh (in directory .): error=2, No such file or directory By any chance, are your SPARK_HOME directories different on the machine where you're submitting from and the cluster? I'm on an

Re: Map Function does not seem to be executing over RDD

2014-07-09 Thread Yana Kadiyska
Does this line println(Retuning +string) from the hash function print what you expect? If you're not seeing that output in the executor log I'd also put some debug statements in case other, since your match in the interesting case is conditioned on if( fieldsList.contains(index)) -- maybe that

Re: incorrect labels being read by MLUtils.loadLabeledData()

2014-07-10 Thread Yana Kadiyska
I do not believe the order of points in a distributed RDD is in any way guaranteed. For a simple test, you can always add a last column which is an id (make it double and throw it in the feature vector). Printing the rdd back will not give you the points in file order. If you don't want to go that

Help using streaming from Spark Shell

2014-07-26 Thread Yana Kadiyska
Hi, I'm starting spark-shell like this: SPARK_MEM=1g SPARK_JAVA_OPTS=-Dspark.cleaner.ttl=3600 /spark/bin/spark-shell -c 3 but when I try to create a streaming context val scc = new StreamingContext(sc, Seconds(10)) I get: org.apache.spark.SparkException: Spark Streaming cannot be used

[Streaming] updateStateByKey trouble

2014-08-06 Thread Yana Kadiyska
Hi folks, hoping someone who works with Streaming can help me out. I have the following snippet: val stateDstream = data.map(x = (x, 1)) .updateStateByKey[State](updateFunc) stateDstream.saveAsTextFiles(checkpointDirectory, partitions_test) where data is a RDD of case class

Spark working directories

2014-08-14 Thread Yana Kadiyska
Hi all, trying to change defaults of where stuff gets written. I've set -Dspark.local.dir=/spark/tmp and I can see that the setting is used when the executor is started. I do indeed see directories like spark-local-20140815004454-bb3f in this desired location but I also see undesired stuff under

Re: a noob question for how to implement setup and cleanup in Spark map

2014-08-19 Thread Yana Kadiyska
Sean, would this work -- rdd.mapPartitions { partition = Iterator(partition) }.foreach( // Some setup code here // save partition to DB // Some cleanup code here ) I tried a pretty simple example ... I can see that the setup and cleanup are executed on the executor node, once per

[Streaming] Cannot get executors to stay alive

2014-08-27 Thread Yana Kadiyska
Hi, I tried a similar question before and didn't get any answers,so I'll try again: I am using updateStateByKey, pretty much exactly as shown in the examples shipping with Spark: def createContext(master:String,dropDir:String, checkpointDirectory:String) = { val updateFunc = (values:

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread Yana Kadiyska
Can you clarify the scenario: val ssc = new StreamingContext(sparkConf, Seconds(10)) ssc.checkpoint(checkpointDirectory) val stream = KafkaUtils.createStream(...) val wordCounts = lines.flatMap(_.split( )).map(x = (x, 1L)) val wordDstream= wordCounts.updateStateByKey[Int](updateFunc)

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-29 Thread Yana Kadiyska
I understand that the DB writes are happening from the workers unless you collect. My confusion is that you believe workers recompute on recovery(nodes computations which get redone upon recovery). My understanding is that checkpointing dumps the RDD to disk and the cuts the RDD lineage. So I

Re: Spark Java Configuration.

2014-09-02 Thread Yana Kadiyska
JavaSparkContext java_SC = new JavaSparkContext(conf); is the spark context. An application has a single spark context -- you won't be able to keep calling this -- you'll see an error if you try to create a second such object from the same application. Additionally, depending on your

[MLib] How do you normalize features?

2014-09-03 Thread Yana Kadiyska
It seems like the next release will add a nice org.apache.spark.mllib.feature package but what is the recommended way to normalize features in the current release (1.0.2) -- I'm hoping for a general pointer here. At the moment I have a RDD[LabeledPoint] and I can get a

Re: Object serialisation inside closures

2014-09-04 Thread Yana Kadiyska
In the third case the object does not get shipped around. Each executor will create it's own instance. I got bitten by this here: http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-object-access-from-mapper-simple-question-tt8125.html On Thu, Sep 4, 2014 at 9:29 AM, Andrianasolo

Re: Multiple spark shell sessions

2014-09-04 Thread Yana Kadiyska
These are just warnings from the web server. Normally your application will have a UI page on port 4040. In your case, a little after the warning it should bind just fine to another port (mine picked 4041). Im running on 0.9.1. Do you actually see the application failing? The main thing when

Re: error: type mismatch while Union

2014-09-05 Thread Yana Kadiyska
Which version are you using -- I can reproduce your issue w/ 0.9.2 but not with 1.0.1...so my guess is that it's a bug and the fix hasn't been backported... No idea on a workaround though.. On Fri, Sep 5, 2014 at 7:58 AM, Dhimant dhimant84.jays...@gmail.com wrote: Hi, I am getting type

Re: Cannot run SimpleApp as regular Java app

2014-09-09 Thread Yana Kadiyska
spark-submit is a script which calls spark-class script. Can you output the command that spark-class runs (say, by putting set -x before the very last line?). You should see the java command that is being run. The scripts do some parameter setting so it's possible you're missing something. It

Re: Problem in running mosek in spark cluster - java.lang.UnsatisfiedLinkError: no mosekjava7_0 in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1738)

2014-09-09 Thread Yana Kadiyska
If that library has native dependencies you'd need to make sure that the native code is on all boxes and in the path with export SPARK_LIBRARY_PATH=... On Tue, Sep 9, 2014 at 10:17 AM, ayandas84 ayanda...@gmail.com wrote: We have a small apache spark cluster of 6 computers. We are trying to

Re: spark.cleaner.ttl and spark.streaming.unpersist

2014-09-10 Thread Yana Kadiyska
Tim, I asked a similar question twice: here http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Cannot-get-executors-to-stay-alive-tt12940.html and here http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Executor-OOM-tt12383.html and have not yet received any responses. I

Need help with ThriftServer/Spark1.1.0

2014-09-15 Thread Yana Kadiyska
Hi ladies and gents, trying to get Thrift server up and running in an effort to replace Shark. My first attempt to run sbin/start-thriftserver resulted in: 14/09/15 17:09:05 ERROR TThreadPoolServer: Error occurred during processing of message. java.lang.RuntimeException:

Re: Serving data

2014-09-16 Thread Yana Kadiyska
If your dashboard is doing ajax/pull requests against say a REST API you can always create a Spark context in your rest service and use SparkSQL to query over the parquet files. The parquet files are already on disk so it seems silly to write both to parquet and to a DB...unless I'm missing

Re: persistent state for spark streaming

2014-10-01 Thread Yana Kadiyska
I don't think persist is meant for end-user usage. You might want to call saveAsTextFiles, for example, if you're saving to the file system as strings. You can also dump the DStream to a DB -- there are samples on this list (you'd have to do a combo of foreachRDD and mapPartitions, likely) On

Re: persistent state for spark streaming

2014-10-02 Thread Yana Kadiyska
, So, user quotas need another data store, which can guarantee persistence and afford frequent data updates/access. Is it correct? Thanks, Chia-Chun 2014-10-01 21:48 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com: I don't think persist is meant for end-user usage. You might want to call

[SparkSQL] Function parity with Shark?

2014-10-02 Thread Yana Kadiyska
Hi, in an effort to migrate off of Shark I recently tried the Thrift JDBC server that comes with Spark 1.1.0. However I observed that conditional functions do not work (I tried 'case' and 'coalesce') some string functions like 'concat' also did not work. Is there a list of what's missing or a

Re: [SparkSQL] Function parity with Shark?

2014-10-03 Thread Yana Kadiyska
, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi, in an effort to migrate off of Shark I recently tried the Thrift JDBC server that comes with Spark 1.1.0. However I observed that conditional functions do not work (I tried 'case' and 'coalesce') some string functions like 'concat' also did

Re: Akka connection refused when running standalone Scala app on Spark 0.9.2

2014-10-03 Thread Yana Kadiyska
when you're running spark-shell and the example, are you actually specifying --master spark://master:7077 as shown here: http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark because if you're not, your spark-shell is running in local mode and not actually connecting to

Re: Akka connection refused when running standalone Scala app on Spark 0.9.2

2014-10-03 Thread Yana Kadiyska
clue. On 03.10.14 19:37, Yana Kadiyska wrote: when you're running spark-shell and the example, are you actually specifying --master spark://master:7077 as shown here: http://spark.apache.org/docs/latest/programming-guide.html# initializing-spark because if you're not, your spark-shell

Re: [SparkSQL] Function parity with Shark?

2014-10-06 Thread Yana Kadiyska
in! These both look like they should have JIRAs. On Fri, Oct 3, 2014 at 8:14 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Thanks -- it does appear that I misdiagnosed a bit: case works generally but it doesn't seem to like the bit operation, which does not seem to work (type of bit_field

Re: Help with using combineByKey

2014-10-09 Thread Yana Kadiyska
If you just want the ratio of positive to all values per key (if I'm reading right) this works val reduced= input.groupByKey().map(grp= grp._2.filter(v=v0).size.toFloat/grp._2.size) reduced.foreach(println) I don't think you need reduceByKey or combineByKey as you're not doing anything where the

Re: Problem executing Spark via JBoss application

2014-10-15 Thread Yana Kadiyska
From this line : Removing executor app-20141015142644-0125/0 because it is EXITED I would guess that you need to examine the executor log to see why the executor actually exited. My guess would be that the executor cannot connect back to your driver. But check the log from the executor. It should

Re: Exception Logging

2014-10-16 Thread Yana Kadiyska
you can out a try catch block in the map function and log the exception. The only tricky part is that the exception log will be located in the executor machine. Even if you don't do any trapping you should see the exception stacktrace in the executors' stderr log which is visible through the UI

Re: create a Row Matrix

2014-10-22 Thread Yana Kadiyska
This works for me import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.linalg.distributed.RowMatrix val v1=Vectors.dense(Array(1d,2d)) val v2=Vectors.dense(Array(3d,4d)) val rows=sc.parallelize(List(v1,v2)) val mat=new RowMatrix(rows) val svd:

Problem packing spark-assembly jar

2014-10-23 Thread Yana Kadiyska
Hi folks, I'm trying to deploy the latest from master branch and having some trouble with the assembly jar. In the spark-1.1 official distribution(I use cdh version), I see the following jars, where spark-assembly-1.1.0-hadoop2.0.0-mr1-cdh4.2.0.jar contains a ton of stuff:

Re: Problem packing spark-assembly jar

2014-10-24 Thread Yana Kadiyska
related but I don't know. On Fri, Oct 24, 2014 at 2:22 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, I'm trying to deploy the latest from master branch and having some trouble with the assembly jar. In the spark-1.1 official distribution(I use cdh version), I see

[Spark SQL] Setting variables

2014-10-24 Thread Yana Kadiyska
Hi all, I'm trying to set a pool for a JDBC session. I'm connecting to the thrift server via JDBC client. My installation appears to be good(queries run fine), I can see the pools in the UI, but any attempt to set a variable (I tried spark.sql.shuffle.partitions and

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-27 Thread Yana Kadiyska
guoxu1231, I struggled with the Idea problem for a full week. Same thing -- clean builds under MVN/Sbt, but no luck with IDEA. What worked for me was the solution posted higher up in this thread -- it's a SO post that basically says to delete all iml files anywhere under the project directory.

Re: problem with start-slaves.sh

2014-10-29 Thread Yana Kadiyska
I see this when I start a worker and then try to start it again forgetting it's already running (I don't use start-slaves, I start the slaves individually with start-slave.sh). All this is telling you is that there is already a running process on that machine. You can see it if you do a ps

Re: CANNOT FIND ADDRESS

2014-10-29 Thread Yana Kadiyska
CANNOT FIND ADDRESS occurs when your executor has crashed. I'll look further down where it shows each task and see if you see any tasks failed. Then you can examine the error log of that executor and see why it died. On Wed, Oct 29, 2014 at 9:35 AM, akhandeshi ami.khande...@gmail.com wrote:

Re: problem with start-slaves.sh

2014-10-30 Thread Yana Kadiyska
: hi Yana, in my case I did not start any spark worker. However, shark was definitely running. Do you think that might be a problem? I will take a look Thank you, -- *From:* Yana Kadiyska [yana.kadiy...@gmail.com] *Sent:* Wednesday, October 29, 2014 9:45 AM

Re: overloaded method value updateStateByKey ... cannot be applied to ... when Key is a Tuple2

2014-11-12 Thread Yana Kadiyska
Adrian, do you know if this is documented somewhere? I was also under the impression that setting a key's value to None would cause the key to be discarded (without any explicit filtering on the user's part) but can not find any official documentation to that effect On Wed, Nov 12, 2014 at 2:43

[SQL] HiveThriftServer2 failure detection

2014-11-19 Thread Yana Kadiyska
Hi all, I am running HiveThriftServer2 and noticed that the process stays up even though there is no driver connected to the Spark master. I started the server via sbin/start-thriftserver and my namenodes are currently not operational. I can see from the log that there was an error on startup:

Re: [SQL] HiveThriftServer2 failure detection

2014-11-19 Thread Yana Kadiyska
https://issues.apache.org/jira/browse/SPARK-4497 On Wed, Nov 19, 2014 at 1:48 PM, Michael Armbrust mich...@databricks.com wrote: This is not by design. Can you please file a JIRA? On Wed, Nov 19, 2014 at 9:19 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi all, I am running

[SQL]Proper use of spark.sql.thriftserver.scheduler.pool

2014-11-19 Thread Yana Kadiyska
Hi sparkers, I'm trying to use spark.sql.thriftserver.scheduler.pool for the first time (earlier I was stuck because of https://issues.apache.org/jira/browse/SPARK-4037) I have two pools setup: [image: Inline image 1] and would like to issue a query against the low priority pool. I am doing

[SQL] Wildcards in SQLContext.parquetFile?

2014-12-03 Thread Yana Kadiyska
Hi folks, I'm wondering if someone has successfully used wildcards with a parquetFile call? I saw this thread and it makes me think no? http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3CCACA1tWLjcF-NtXj=pqpqm3xk4aj0jitxjhmdqbojj_ojybo...@mail.gmail.com%3E I have a set

Re: Cluster getting a null pointer error

2014-12-10 Thread Yana Kadiyska
does spark-submit with SparkPi and spark-examples.jar work? e.g. ./spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://xx.xx.xx.xx:7077 /path/to/examples.jar On Tue, Dec 9, 2014 at 6:58 PM, Eric Tanner eric.tan...@justenough.com wrote: I have set up a cluster

Trouble with cache() and parquet

2014-12-10 Thread Yana Kadiyska
Hi folks, wondering if anyone has thoughts. Trying to create something akin to a materialized view (sqlContext is a HiveContext connected to external metastore): val last2HourRdd = sqlContext.sql(sselect * from mytable) //last2HourRdd.first prints out a org.apache.spark.sql.Row = [...] with

Re: Trouble with cache() and parquet

2014-12-11 Thread Yana Kadiyska
in the parquet file? One way to test would be to just use sqlContext.parquetFile(...) which infers the schema from the file instead of using the metastore. On Wed, Dec 10, 2014 at 12:46 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, wondering if anyone has thoughts. Trying to create

Re: Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-13 Thread Yana Kadiyska
Since you mentioned this, I had a related quandry recently -- it also says that the forum archives *u...@spark.incubator.apache.org u...@spark.incubator.apache.org/* *d...@spark.incubator.apache.org d...@spark.incubator.apache.org *respectively, yet the Community page clearly says to email the

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Yana Kadiyska
Denny, I am not sure what exception you're observing but I've had luck with 2 things: val table = sc.textFile(hdfs://) You can try calling table.first here and you'll see the first line of the file. You can also do val debug = table.first.split(\t) which would give you an array and you can

Re: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-23 Thread Yana Kadiyska
if you're running the test via sbt you can examine the classpath that sbt uses for the test (show runtime:full-classpath or last run)-- I find this helps once too many includes and excludes interact. On Thu, Jan 22, 2015 at 3:50 PM, Adrian Mocanu amoc...@verticalscope.com wrote: I use spark

Re: Results never return to driver | Spark Custom Reader

2015-01-23 Thread Yana Kadiyska
It looks to me like your executor actually crashed and didn't just finish properly. Can you check the executor log? It is available in the UI, or on the worker machine, under $SPARK_HOME/work/ app-20150123155114-/6/stderr (unless you manually changed the work directory location but in that

Re: spark-shell has syntax error on windows.

2015-01-23 Thread Yana Kadiyska
...@gmail.com wrote: Do you mind filing a JIRA issue for this which includes the actual error message string that you saw? https://issues.apache.org/jira/browse/SPARK On Thu, Jan 22, 2015 at 8:31 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: I am not sure if you get the same exception as I

Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-19 Thread Yana Kadiyska
If you're talking about filter pushdowns for parquet files this also has to be turned on explicitly. Try *spark.sql.parquet.**filterPushdown=true . *It's off by default On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wangxy...@gmail.com wrote: Yes it works! But the filter can't pushdown!!! If

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-17 Thread Yana Kadiyska
Just wondering if you've made any progress on this -- I'm having the same issue. My attempts to help myself are documented here http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAJ4HpHFVKvdNgKes41DvuFY=+f_nTJ2_RT41+tadhNZx=bc...@mail.gmail.com%3E . I don't believe I have the

Re: Issues with constants in Spark HiveQL queries

2015-01-14 Thread Yana Kadiyska
scala sql(“SELECT user_id FROM actions where conversion_aciton_id=20141210”) *From:* Yana Kadiyska [mailto:yana.kadiy...@gmail.com] *Sent:* Wednesday, January 14, 2015 11:12 PM *To:* Pala M Muthaia *Cc:* user@spark.apache.org *Subject:* Re: Issues with constants in Spark HiveQL queries

Re: Issues with constants in Spark HiveQL queries

2015-01-14 Thread Yana Kadiyska
Just a guess but what is the type of conversion_aciton_id? I do queries over an epoch all the time with no issues(where epoch's type is bigint). You can see the source here https://github.com/apache/spark/blob/v1.2.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala -- not sure what

Re: Using Spark SQL with multiple (avro) files

2015-01-14 Thread Yana Kadiyska
If the wildcard path you have doesn't work you should probably open a bug -- I had a similar problem with Parquet and it was a bug which recently got closed. Not sure if sqlContext.avroFile shares a codepath with .parquetFile...you can try running with bits that have the fix for .parquetFile or

Re: spark-shell has syntax error on windows.

2015-01-22 Thread Yana Kadiyska
I am not sure if you get the same exception as I do -- spark-shell2.cmd works fine for me. Windows 7 as well. I've never bothered looking to fix it as it seems spark-shell just calls spark-shell2 anyway... On Thu, Jan 22, 2015 at 3:16 AM, Vladimir Protsenko protsenk...@gmail.com wrote: I have a

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-21 Thread Yana Kadiyska
-site.xml, and then re-run the query. I can see significant differences by doing so. I’ll open a JIRA and deliver a fix for this ASAP. Thanks again for reporting all the details! Cheng On 1/13/15 12:56 PM, Yana Kadiyska wrote: Attempting to bump this up in case someone can help out after

Re: Installing Spark Standalone to a Cluster

2015-01-22 Thread Yana Kadiyska
You can do ./sbin/start-slave.sh --master spark://IP:PORT. I believe you're missing --master. In addition, it's a good idea to pass with --master exactly the spark master's endpoint as shown on your UI under http://localhost:8080. But that should do it. If that's not working, you can look at the

Parquet predicate pushdown troubles

2015-01-09 Thread Yana Kadiyska
I am running the following (connecting to an external Hive Metastore) /a/shark/spark/bin/spark-shell --master spark://ip:7077 --conf *spark.sql.parquet.filterPushdown=true* val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) and then ran two queries: sqlContext.sql(select count(*)

Re: Getting Output From a Cluster

2015-01-08 Thread Yana Kadiyska
are you calling the saveAsText files on the DStream --looks like it? Look at the section called Design Patterns for using foreachRDD in the link you sent -- you want to do dstream.foreachRDD(rdd = rdd.saveAs) On Thu, Jan 8, 2015 at 5:20 PM, Su She suhsheka...@gmail.com wrote: Hello

textFile partitions

2015-02-09 Thread Yana Kadiyska
Hi folks, puzzled by something pretty simple: I have a standalone cluster with default parallelism of 2, spark-shell running with 2 cores sc.textFile(README.md).partitions.size returns 2 (this makes sense) sc.textFile(README.md).coalesce(100,true).partitions.size returns 100, also makes sense

Re: Saving partial (top 10) DStream windows to hdfs

2015-01-08 Thread Yana Kadiyska
On Wednesday, January 7, 2015 5:25 PM, Laeeq Ahmed laeeqsp...@yahoo.com.INVALID wrote: Hi Yana, I also think that *val top10 = your_stream.mapPartitions(rdd = rdd.take(10))* will give top 10 from each partition. I will try your code. Regards, Laeeq On Wednesday, January 7, 2015 5:19 PM, Yana

Re: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-11 Thread Yana Kadiyska
You might also want to see if TaskScheduler helps with that. I have not used it with Windows 2008 R2 but it generally does allow you to schedule a bat file to run on startup On Wed, Mar 11, 2015 at 10:16 AM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Thanks for the suggestion.

Re: Errors in spark

2015-02-27 Thread Yana Kadiyska
: Hi yana, I have removed hive-site.xml from spark/conf directory but still getting the same errors. Anyother way to work around. Regards, Sandeep On Fri, Feb 27, 2015 at 9:38 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: I think you're mixing two things: the docs say When

Re: Errors in spark

2015-02-27 Thread Yana Kadiyska
I think you're mixing two things: the docs say When* not *configured by the hive-site.xml, the context automatically creates metastore_db and warehouse in the current directory.. AFAIK if you want a local metastore, you don't put hive-site.xml anywhere. You only need the file if you're going to

Re: Running multiple threads with same Spark Context

2015-02-25 Thread Yana Kadiyska
the program after setting the property spark.scheduler.mode to FAIR. But the result is same as previous. Are there any other properties that have to be set? On Tue, Feb 24, 2015 at 10:26 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: It's hard to tell. I have not run this on EC2 but this worked

Executor size and checkpoints

2015-02-21 Thread Yana Kadiyska
Hi all, I had a streaming application and midway through things decided to up the executor memory. I spent a long time launching like this: ~/spark-1.2.0-bin-cdh4/bin/spark-submit --class StreamingTest --executor-memory 2G --master... and observing the executor memory is still at old 512

Re: Executor size and checkpoints

2015-02-24 Thread Yana Kadiyska
config took affect. Maybe. :) TD On Sat, Feb 21, 2015 at 7:30 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi all, I had a streaming application and midway through things decided to up the executor memory. I spent a long time launching like this: ~/spark-1.2.0-bin-cdh4/bin/spark-submit

[SparkSQL] Number of map tasks in SparkSQL

2015-02-24 Thread Yana Kadiyska
Shark used to have shark.map.tasks variable. Is there an equivalent for Spark SQL? We are trying a scenario with heavily partitioned Hive tables. We end up with a UnionRDD with a lot of partitions underneath and hence too many tasks:

Re: Running multiple threads with same Spark Context

2015-02-24 Thread Yana Kadiyska
It's hard to tell. I have not run this on EC2 but this worked for me: The only thing that I can think of is that the scheduling mode is set to - *Scheduling Mode:* FAIR val pool: ExecutorService = Executors.newFixedThreadPool(poolSize) while_loop to get curr_job pool.execute(new

[SparkSQL, Spark 1.2] UDFs in group by broken?

2015-02-26 Thread Yana Kadiyska
Can someone confirm if they can run UDFs in group by in spark1.2? I have two builds running -- one from a custom build from early December (commit 4259ca8dd12) which works fine, and Spark1.2-RC2. On the latter I get: jdbc:hive2://XXX.208:10001 select

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska
Imran, I have also observed the phenomenon of reducing the cores helping with OOM. I wanted to ask this (hopefully without straying off topic): we can specify the number of cores and the executor memory. But we don't get to specify _how_ the cores are spread among executors. Is it possible that

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska
Yong, for the 200 tasks in stage 2 and 3 -- this actually comes from the shuffle setting: spark.sql.shuffle.partitions On Thu, Feb 26, 2015 at 5:51 PM, java8964 java8...@hotmail.com wrote: Imran, thanks for your explaining about the parallelism. That is very helpful. In my test case, I am

[SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-13 Thread Yana Kadiyska
. On Fri, Jan 9, 2015 at 2:56 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: I am running the following (connecting to an external Hive Metastore) /a/shark/spark/bin/spark-shell --master spark://ip:7077 --conf *spark.sql.parquet.filterPushdown=true* val sqlContext = new

[SQL] Simple DataFrame questions

2015-04-02 Thread Yana Kadiyska
Hi folks, having some seemingly noob issues with the dataframe API. I have a DF which came from the csv package. 1. What would be an easy way to cast a column to a given type -- my DF columns are all typed as strings coming from a csv. I see a schema getter but not setter on DF 2. I am trying

Escaping user input for Hive queries

2015-05-05 Thread Yana Kadiyska
Hi folks, we have been using the a JDBC connection to Spark's Thrift Server so far and using JDBC prepared statements to escape potentially malicious user input. I am trying to port our code directly to HiveContext now (i.e. eliminate the use of Thrift Server) and I am not quite sure how to

Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Yana Kadiyska
Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-3928 Looks like for now you'd have to list the full paths...I don't see a comment from an official spark committer so still not sure if this is a bug or design, but it seems to be the current state of affairs. On Thu, May 7, 2015 at

  1   2   >