Deleting temp dir Exception

2014-11-03 Thread Josh
Hi, I've written a short scala app to perform word counts on a text file and am getting the following exception as the program completes (after it prints out all of the word counts). Exception in thread delete Spark temp dir C:\Users\Josh\AppData\Local\Temp\spark-0fdd0b79-7329-4690-a093

Re: flatten RDD[RDD[T]]

2014-03-02 Thread Josh Rosen
Nope, nested RDDs aren't supported: https://groups.google.com/d/msg/spark-users/_Efj40upvx4/DbHCixW7W7kJ https://groups.google.com/d/msg/spark-users/KC1UJEmUeg8/N_qkTJ3nnxMJ https://groups.google.com/d/msg/spark-users/rkVPXAiCiBk/CORV5jyeZpAJ On Sun, Mar 2, 2014 at 5:37 PM, Cosmin Radoi

Re: Spark and HBase

2014-04-25 Thread Josh Mahonin
or not though, so if anyone else is looking into this, I'd love to hear their thoughts. Josh On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Just took a quick look at the overview herehttp://phoenix.incubator.apache.org/ and the quick start guide herehttp

advice on maintaining a production spark cluster?

2014-05-16 Thread Josh Marcus
Hey folks, I'm wondering what strategies other folks are using for maintaining and monitoring the stability of stand-alone spark clusters. Our master very regularly loses workers, and they (as expected) never rejoin the cluster. This is the same behavior I've seen using akka cluster (if that's

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus
...@gmail.com'); wrote: Which version is this with? I haven’t seen standalone masters lose workers. Is there other stuff on the machines that’s killing them, or what errors do you see? Matei On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.comjavascript:_e(%7B%7D,'cvml','jmar

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus
, 2014 at 3:28 PM, Josh Marcus jmar...@meetup.com wrote: We're using spark 0.9.0, and we're using it out of the box -- not using Cloudera Manager or anything similar. There are warnings from the master that there continue to be heartbeats from the unregistered workers. I will see

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus
and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight into when it happens, but usually after heavy use and after running for a long time. I had figured I'd see if the changes since 0.9.0 addressed it and revisit later. On Tue, May 20, 2014 at 8:37 PM, Josh Marcus jmar

Re: Working with Avro Generic Records in the interactive scala shell

2014-05-24 Thread Josh Marcus
Jeremy, Just to be clear, are you assembling a jar with that class compiled (with its dependencies) and including the path to that jar on the command line in an environment variable (e.g. SPARK_CLASSPATH=path ./spark-shell)? --j On Saturday, May 24, 2014, Jeremy Lewi jer...@lewi.us wrote: Hi

Spark 1.0.1 EC2 - Launching Applications

2014-07-14 Thread Josh Happoldt
submit jobs to the cluster either. Thanks! Josh

Re: Broadcasting a set in PySpark

2014-07-18 Thread Josh Rosen
You have to use `myBroadcastVariable.value` to access the broadcasted value; see https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables On Fri, Jul 18, 2014 at 2:56 PM, Vedant Dhandhania ved...@retentionscience.com wrote: Hi All, I am trying to broadcast a set in a

Re: Data from Mysql using JdbcRDD

2014-07-30 Thread Josh Mahonin
, upper bound index, and number of partitions. With that example query and those values, you should end up with an RDD with two partitions, one with the student_info from 1 through 10, and the second with ids 11 through 20. Josh On Wed, Jul 30, 2014 at 6:58 PM, chaitu reddy chaitzre...@gmail.com

Re: Question on mappartitionwithsplit

2014-08-17 Thread Josh Rosen
Has anyone tried using functools.partial ( https://docs.python.org/2/library/functools.html#functools.partial) with PySpark? If it works, it might be a nice way to address this use-case. On Sun, Aug 17, 2014 at 7:35 PM, Davies Liu dav...@databricks.com wrote: On Sun, Aug 17, 2014 at 11:21 AM,

Difference between amplab docker and spark docker?

2014-08-20 Thread Josh J
Hi, Whats the difference between amplab docker https://github.com/amplab/docker-scripts and spark docker https://github.com/apache/spark/tree/master/docker? Thanks, Josh

multiple windows from the same DStream ?

2014-08-21 Thread Josh J
windowMessages1 = messages.window(windowLength,slideInterval); JavaPairDStreamString,String windowMessages2 = messages.window(windowLength,slideInterval); Thanks, Josh

DStream start a separate DStream

2014-08-21 Thread Josh J
DStream. How can I accomplish this with spark? Sincerely, Josh

countByWindow save the count ?

2014-08-22 Thread Josh J
Hi, Hopefully a simple question. Though is there an example of where to save the output of countByWindow ? I would like to save the results to external storage (kafka or redis). The examples show only stream.print() Thanks, Josh

Re: countByWindow save the count ?

2014-08-26 Thread Josh J
of countByWindow with a function that performs the save operation. On Fri, Aug 22, 2014 at 1:58 AM, Josh J joshjd...@gmail.com wrote: Hi, Hopefully a simple question. Though is there an example of where to save the output of countByWindow ? I would like to save the results to external storage (kafka

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Josh Rosen
If I recall, you should be able to start Hadoop MapReduce using ~/ephemeral-hdfs/sbin/start-mapred.sh. On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com wrote: Hi, I would like to copy log files from s3 to the cluster's ephemeral-hdfs. I tried to use distcp, but I guess

Re: pyspark on python 3

2014-10-03 Thread Josh Rosen
/2144 - Josh On Fri, Oct 3, 2014 at 6:44 PM, tomo cocoa cocoatom...@gmail.com wrote: Hi, I prefer that PySpark can also be executed on Python 3. Do you have some reason or demand to use PySpark through Python3? If you create an issue on JIRA, I would try to resolve it. On 4 October

Re: small bug in pyspark

2014-10-12 Thread Josh Rosen
drivers and workers. - Josh On Fri, Oct 10, 2014 at 5:24 PM, Andy Davidson a...@santacruzintegration.com wrote: Hi I am running spark on an ec2 cluster. I need to update python to 2.7. I have been following the directions on http://nbviewer.ipython.org/gist/JoshRosen/6856670 https

Re: What if I port Spark from TCP/IP to RDMA?

2014-10-12 Thread Josh Rosen
Hi Theo, Check out *spark-perf*, a suite of performance benchmarks for Spark: https://github.com/databricks/spark-perf. - Josh On Fri, Oct 10, 2014 at 7:27 PM, Theodore Si sjyz...@gmail.com wrote: Hi, Let's say that I managed to port Spark from TCP/IP to RDMA. What tool or benchmark can I

streaming join sliding windows

2014-10-22 Thread Josh J
Hi, How can I join neighbor sliding windows in spark streaming? Thanks, Josh

docker spark 1.1.0 cluster

2014-10-24 Thread Josh J
Hi, Is there a dockerfiles available which allow to setup a docker spark 1.1.0 cluster? Thanks, Josh

exact count using rdd.count()?

2014-10-27 Thread Josh J
than once in the event of a worker failure. http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node Thanks, Josh

combine rdds?

2014-10-27 Thread Josh J
Hi, How could I combine rdds? I would like to combine two RDDs if the count in an RDD is not above some threshold. Thanks, Josh

run multiple spark applications in parallel

2014-10-28 Thread Josh J
Hi, How do I run multiple spark applications in parallel? I tried to run on yarn cluster, though the second application submitted does not run. Thanks, Josh

Re: run multiple spark applications in parallel

2014-10-28 Thread Josh J
, Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz 32 GB RAM Thanks, Josh On Tue, Oct 28, 2014 at 4:15 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Try reducing the resources (cores and memory) of each application. On Oct 28, 2014, at 7:05 PM, Josh J joshjd...@gmail.com wrote: Hi, How

random shuffle streaming RDDs?

2014-11-03 Thread Josh J
Hi, Is there a nice or optimal method to randomly shuffle spark streaming RDDs? Thanks, Josh

Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
? in general RDDs don't have ordering at all -- excepting when you sort for example -- so a permutation doesn't make sense. Do you just want a well-defined but random ordering of the data? Do you just want to (re-)assign elements randomly to partitions? On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd

Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
is guaranteed about that. If you want to permute an RDD, how about a sortBy() on a good hash function of each value plus some salt? (Haven't thought this through much but sounds about right.) On Mon, Nov 3, 2014 at 4:59 PM, Josh J joshjd...@gmail.com wrote: When I'm outputting the RDDs

Re: scala RDD sortby compilation error

2014-11-04 Thread Josh J
: Ordering[K], implicit ctag: scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String]. Unspecified value parameter f. On Tue, Nov 4, 2014 at 11:28 AM, Josh J joshjd...@gmail.com wrote: Hi, Does anyone have any good examples of using sortby for RDDs and scala? I'm receiving not enough

convert ListString to dstream

2014-11-10 Thread Josh J
) and found : java.util.LinkedList[org.apache.spark.rdd.RDD[String]] required: scala.collection.mutable.Queue[org.apache.spark.rdd.RDD[?]] Thanks, Josh

concat two Dstreams

2014-11-11 Thread Josh J
Hi, Is it possible to concatenate or append two Dstreams together? I have an incoming stream that I wish to combine with data that's generated by a utility. I then need to process the combined Dstream. Thanks, Josh

Re: concat two Dstreams

2014-11-11 Thread Josh J
I think it's just called union On Tue, Nov 11, 2014 at 2:41 PM, Josh J joshjd...@gmail.com wrote: Hi, Is it possible to concatenate or append two Dstreams together? I have an incoming stream that I wish to combine with data that's generated by a utility. I then need to process the combined

Adaptive stream processing and dynamic batch sizing

2014-11-14 Thread Josh J
Hi, I was wondering if the adaptive stream processing and dynamic batch processing was available to use in spark streaming? If someone could help point me in the right direction? Thanks, Josh

Re: Adaptive stream processing and dynamic batch sizing

2014-11-14 Thread Josh J
Referring to this paper http://dl.acm.org/citation.cfm?id=2670995. On Fri, Nov 14, 2014 at 10:42 AM, Josh J joshjd...@gmail.com wrote: Hi, I was wondering if the adaptive stream processing and dynamic batch processing was available to use in spark streaming? If someone could help point me

Re: Spark SQL with Apache Phoenix lower and upper Bound

2014-11-21 Thread Josh Mahonin
: https://github.com/simplymeasured/phoenix-spark Josh On Fri, Nov 21, 2014 at 4:14 PM, Alaa Ali contact.a...@gmail.com wrote: I want to run queries on Apache Phoenix which has a JDBC driver. The query that I want to run is: select ts,ename from random_data_date limit 10 But I'm having issues

Re: Spark SQL with Apache Phoenix lower and upper Bound

2014-11-24 Thread Josh Mahonin
also do a lot more with it than just the Phoenix functions provide. I don't know if this works with PySpark or not, but assuming the 'newHadoopRDD' functionality works for other input formats, it should work for Phoenix as well. Josh On Fri, Nov 21, 2014 at 5:12 PM, Alaa Ali contact.a...@gmail.com

kafka pipeline exactly once semantics

2014-11-30 Thread Josh J
can maintain exactly once semantics when writing to topic 2? Thanks, Josh

Re: Publishing a transformed DStream to Kafka

2014-11-30 Thread Josh J
Is there a way to do this that preserves exactly once semantics for the write to Kafka? On Tue, Sep 2, 2014 at 12:30 PM, Tim Smith secs...@gmail.com wrote: I'd be interested in finding the answer too. Right now, I do: val kafkaOutMsgs = kafkInMessages.map(x=myFunc(x._2,someParam))

Re: java.io.InvalidClassException: org.apache.spark.api.java.JavaUtils$SerializableMapWrapper; no valid constructor

2014-12-01 Thread Josh Rosen
SerializableMapWrapper was added in https://issues.apache.org/jira/browse/SPARK-3926; do you mind opening a new JIRA and linking it to that one? On Mon, Dec 1, 2014 at 12:17 AM, lokeshkumar lok...@dataken.net wrote: The workaround was to wrap the map returned by spark libraries into HashMap

Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-13 Thread Josh Rosen
by the mailing list. I wanted to mention this issue to the Spark community to see whether there are any good solutions to address this. I have spoken to users who think that our mailing list is unresponsive / inactive because their un-posted messages haven't received any replies. - Josh

Re: Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-17 Thread Josh Rosen
will be sent to both spark.incubator.apache.org and spark.apache.org (if that is the case, i'm not sure which alias nabble posts get sent to) would make things a lot more clear. On Sat, Dec 13, 2014 at 5:05 PM, Josh Rosen rosenvi...@gmail.com wrote: I've noticed that several users are attempting to post

Re: Discourse: A proposed alternative to the Spark User list

2014-12-25 Thread Josh Rosen
a bit of additional context in the meantime. - Josh On Thu, Dec 25, 2014 at 5:36 PM, Tobias Pfeiffer t...@preferred.jp wrote: Nick, uh, I would have expected a rather heated discussion, but the opposite seems to be the case ;-) Independent of my personal preferences w.r.t. usability, habits etc

Re: action progress in ipython notebook?

2014-12-27 Thread Josh Rosen
/ stage / task progress information, as well as expanding the types of information exposed through the stable status API interface. - Josh On Thu, Dec 25, 2014 at 10:01 AM, Eric Friedman eric.d.fried...@gmail.com wrote: Spark 1.2.0 is SO much more usable than previous releases -- many thanks

sample is not a member of org.apache.spark.streaming.dstream.DStream

2014-12-28 Thread Josh J
).map(_._2) streamtoread.sample(withReplacement = true, fraction = fraction) How do I use the sample http://spark.apache.org/docs/latest/programming-guide.html#transformations() method with Spark Streaming? Thanks, Josh

Re: action progress in ipython notebook?

2014-12-29 Thread Josh Rosen
Josh Is there documentation available for status API? I would like to use it. Thanks, Aniket On Sun Dec 28 2014 at 02:37:32 Josh Rosen rosenvi...@gmail.com wrote: The console progress bars are implemented on top of a new stable status API that was added in Spark 1.2. It's possible

Re: Shuffle Problems in 1.2.0

2014-12-30 Thread Josh Rosen
Hi Sven, Do you have a small example program that you can share which will allow me to reproduce this issue? If you have a workload that runs into this, you should be able to keep iteratively simplifying the job and reducing the data set size until you hit a fairly minimal reproduction (assuming

Re: SparkContext with error from PySpark

2014-12-30 Thread Josh Rosen
To configure the Python executable used by PySpark, see the Using the Shell Python section in the Spark Programming Guide: https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell You can set the PYSPARK_PYTHON environment variable to choose the Python executable that will be

Re: NullPointerException

2014-12-31 Thread Josh Rosen
Which version of Spark are you using? On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I get this following Exception when I submit spark application that calculates the frequency of characters in a file. Especially, when I increase the size of data, I

Re: NullPointerException

2014-12-31 Thread Josh Rosen
:04 PM, Josh Rosen rosenvi...@gmail.com wrote: Which version of Spark are you using? On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I get this following Exception when I submit spark application that calculates the frequency of characters in a file

Re: DAG info

2015-01-01 Thread Josh Rosen
This log message is normal; in this case, this message is saying that the final stage needed to compute your job does not have any dependencies / parent stages and that there are no parent stages that need to be computed. On Thu, Jan 1, 2015 at 11:02 PM, shahid sha...@trialx.com wrote: hi guys

Re: spark.akka.frameSize limit error

2015-01-03 Thread Josh Rosen
Which version of Spark are you using? It seems like the issue here is that the map output statuses are too large to fit in the Akka frame size. This issue has been fixed in Spark 1.2 by using a different encoding for map outputs for jobs with many reducers (

Re: spark-shell has syntax error on windows.

2015-01-23 Thread Josh Rosen
Do you mind filing a JIRA issue for this which includes the actual error message string that you saw? https://issues.apache.org/jira/browse/SPARK On Thu, Jan 22, 2015 at 8:31 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: I am not sure if you get the same exception as I do --

Re: how to run python app in yarn?

2015-01-14 Thread Josh Rosen
There's an open PR for supporting yarn-cluster mode in PySpark: https://github.com/apache/spark/pull/3976 (currently blocked on reviewer attention / time) On Wed, Jan 14, 2015 at 3:16 PM, Marcelo Vanzin van...@cloudera.com wrote: As the error message says... On Wed, Jan 14, 2015 at 3:14 PM,

Re: Recent Git Builds Application WebUI Problem and Exception Stating Log directory /tmp/spark-events does not exist.

2015-01-18 Thread Josh Rosen
This looks like a bug in the master branch of Spark, related to some recent changes to EventLoggingListener. You can reproduce this bug on a fresh Spark checkout by running ./bin/spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/tmp/nonexistent-dir where

Re: dockerized spark executor on mesos?

2015-01-14 Thread Josh J
We have dockerized Spark Master and worker(s) separately and are using it in our dev environment. Is this setup available on github or dockerhub? On Tue, Dec 9, 2014 at 3:50 PM, Venkat Subramanian vsubr...@gmail.com wrote: We have dockerized Spark Master and worker(s) separately and are

measuring time taken in map, reduceByKey, filter, flatMap

2015-01-30 Thread Josh J
Hi, I have a stream pipeline which invokes map, reduceByKey, filter, and flatMap. How can I measure the time taken in each stage? Thanks, Josh

Re: performance of saveAsTextFile moving files from _temporary

2015-01-27 Thread Josh Walton
I'm not sure how to confirm how the moving is happening, however, one of the jobs just completed that I was talking about with 9k files of 4mb each. Spark UI showed the job being complete after ~2 hours. The last four hours of the job was just moving the files from _temporary to their final

Re: Mesos resource allocation

2015-01-05 Thread Josh Devins
thoughts and actually very curious about how others are running Spark on Mesos with large heaps (as a result of large memory machines). Perhaps this is a non-issue when we have more multi-tenancy in the cluster, but for now, this is not the case. Thanks, Josh On 24 December 2014 at 06:22, Tim Chen

train many decision tress with a single spark job

2015-01-10 Thread Josh Buffum
I've got a data set of activity by user. For each user, I'd like to train a decision tree model. I currently have the feature creation step implemented in Spark and would naturally like to use mllib's decision tree model. However, it looks like the decision tree model expects the whole RDD and

Re: train many decision tress with a single spark job

2015-01-12 Thread Josh Buffum
(data) but just to deal with it on whatever spark worker is handling kvp? Does that question make sense? Thanks! Josh On Sun, Jan 11, 2015 at 4:12 AM, Sean Owen so...@cloudera.com wrote: You just mean you want to divide the data set into N subsets, and do that dividing by user, not make one

Re: train many decision tress with a single spark job

2015-01-12 Thread Josh Buffum
are using RDDs inside RDDs. But I am also not sure you should do what it looks like you are trying to do. On Jan 13, 2015 12:32 AM, Josh Buffum jbuf...@gmail.com wrote: Sean, Thanks for the response. Is there some subtle difference between one model partitioned by N users or N models per each 1 user

spark standalone master with workers on two nodes

2015-01-13 Thread Josh J
Hi, I'm trying to run Spark Streaming standalone on two nodes. I'm able to run on a single node fine. I start both workers and it registers in the Spark UI. However, the application says SparkDeploySchedulerBackend: Asked to remove non-existent executor 2 Any ideas? Thanks, Josh

Re: Shuffle Problems in 1.2.0

2015-01-04 Thread Josh Rosen
hard to say from this error trace alone. On December 30, 2014 at 5:17:08 PM, Sven Krasser (kras...@gmail.com) wrote: Hey Josh, I am still trying to prune this to a minimal example, but it has been tricky since scale seems to be a factor. The job runs over ~720GB of data (the cluster's total RAM

Re: spark.akka.frameSize limit error

2015-01-04 Thread Josh Rosen
fix. In the meantime, I recommend that you increase your Akka frame size. On Sat, Jan 3, 2015 at 8:51 PM, Saeed Shahrivari saeed.shahriv...@gmail.com wrote: I use the 1.2 version. On Sun, Jan 4, 2015 at 3:01 AM, Josh Rosen rosenvi...@gmail.com wrote: Which version of Spark are you using

Re: Repartition Memory Leak

2015-01-04 Thread Josh Rosen
@Brad, I'm guessing that the additional memory usage is coming from the shuffle performed by coalesce, so that at least explains the memory blowup. On Sun, Jan 4, 2015 at 10:16 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You can try: - Using KryoSerializer - Enabling RDD Compression -

Re: Spark Standalone Cluster not correctly configured

2015-01-08 Thread Josh Rosen
Can you please file a JIRA issue for this? This will make it easier to triage this issue. https://issues.apache.org/jira/browse/SPARK Thanks, Josh On Thu, Jan 8, 2015 at 2:34 AM, frodo777 roberto.vaquer...@bitmonlab.com wrote: Hello everyone. With respect to the configuration problem

Re: Streaming scheduling delay

2015-03-01 Thread Josh J
On Fri, Feb 13, 2015 at 2:21 AM, Gerard Maas gerard.m...@gmail.com wrote: KafkaOutputServicePool Could you please give an example code of how KafkaOutputServicePool would look like? When I tried object pooling I end up with various not serializable exceptions. Thanks! Josh

Re: throughput in the web console?

2015-02-25 Thread Josh J
at 10:29 PM, Josh J joshjd...@gmail.com wrote: Hi, I plan to run a parameter search varying the number of cores, epoch, and parallelism. The web console provides a way to archive the previous runs, though is there a way to view in the console the throughput? Rather than logging

Re: throughput in the web console?

2015-02-25 Thread Josh J
On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das ak...@sigmoidanalytics.com wrote: For SparkStreaming applications, there is already a tab called Streaming which displays the basic statistics. Would I just need to extend this tab to add the throughput?

Re: Which OutputCommitter to use for S3?

2015-02-20 Thread Josh Rosen
We (Databricks) use our own DirectOutputCommitter implementation, which is a couple tens of lines of Scala code. The class would almost entirely be a no-op except we took some care to properly handle the _SUCCESS file. On Fri, Feb 20, 2015 at 3:52 PM, Mingyu Kim m...@palantir.com wrote: I

throughput in the web console?

2015-02-24 Thread Josh J
the logs files to the web console processing times? Thanks, Josh

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-08 Thread Josh Rosen
Do you have any more specific profiling data that you can share? I'm curious to know where AppendOnlyMap.changeValue is being called from. On Fri, May 8, 2015 at 1:26 PM, Michal Haris michal.ha...@visualdna.com wrote: +dev On 6 May 2015 10:45, Michal Haris michal.ha...@visualdna.com wrote:

Re: Does long-lived SparkContext hold on to executor resources?

2015-05-12 Thread Josh Rosen
I would be cautious regarding use of spark.cleaner.ttl, as it can lead to confusing error messages if time-based cleaning deletes resources that are still needed. See my comment at

Python 3 support for PySpark has been merged into master

2015-04-16 Thread Josh Rosen
the PySpark unit tests locally to make sure that the change still work correctly in older branches. I can also help with backports / fixing conflicts. Thanks to Davies Liu, Shane Knapp, Thom Neale, Xiangrui Meng, and everyone else who helped with this patch. - Josh

Re: A problem with Spark 1.3 artifacts

2015-04-06 Thread Josh Rosen
to continue debugging this issue, I think we should move this discussion over to JIRA so it's easier to track and reference. Hope this helps, Josh On Thu, Apr 2, 2015 at 7:34 AM, Jacek Lewandowski jacek.lewandow...@datastax.com wrote: A very simple example which works well with Spark 1.2

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-09 Thread Josh Mahonin
suspect that keeping all of the spark and phoenix dependencies marked as 'provided', and including the Phoenix client JAR in the Spark classpath would work as well. Good luck, Josh On Tue, Jun 9, 2015 at 4:40 AM, Jeroen Vlek j.v...@anchormen.nl wrote: Hi, I posted a question with regards

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-10 Thread Josh Mahonin
Josh On Wed, Jun 10, 2015 at 4:11 AM, Jeroen Vlek j.v...@anchormen.nl wrote: Hi Josh, Thank you for your effort. Looking at your code, I feel that mine is semantically the same, except written in Java. The dependencies in the pom.xml all have the scope provided. The job is submitted

Re: Fully in-memory shuffles

2015-06-10 Thread Josh Rosen
There's a discussion of this at https://github.com/apache/spark/pull/5403 On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet cjno...@gmail.com wrote: Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)?

Re: org.apache.spark.sql.ScalaReflectionLock

2015-06-23 Thread Josh Rosen
Mind filing a JIRA? On Tue, Jun 23, 2015 at 9:34 AM, Koert Kuipers ko...@tresata.com wrote: just a heads up, i was doing some basic coding using DataFrame, Row, StructType, etc. and i ended up with deadlocks in my sbt tests due to the usage of ScalaReflectionLock.synchronized in the spark

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to read chunk

2015-06-25 Thread Josh Rosen
Which Spark version are you using? AFAIK the corruption bugs in sort-based shuffle should have been fixed in newer Spark releases. On Wed, Jun 24, 2015 at 12:25 PM, Piero Cinquegrana pcinquegr...@marketshare.com wrote: Switching spark.shuffle.manager from sort to hash fixed this issue as

Re: Serializer not switching

2015-06-22 Thread Josh Rosen
My hunch is that you changed spark.serializer to Kryo but left spark.closureSerializer unmodified, so it's still using Java for closure serialization. Kryo doesn't really work as a closure serializer but there's an open pull request to fix this: https://github.com/apache/spark/pull/6361 On Mon,

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-11 Thread Josh Mahonin
with a GMane link to the thread? Good luck, Josh On Thu, Jun 11, 2015 at 2:38 AM, Jeroen Vlek j.v...@anchormen.nl wrote: Hi Josh, That worked! Thank you so much! (I can't believe it was something so obvious ;) ) If you care about such a thing you could answer my question here for bounty

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-13 Thread Josh Rosen
-- *From:* Josh Rosen rosenvi...@gmail.com *To:* Sanjay Subramanian sanjaysubraman...@yahoo.com *Cc:* user@spark.apache.org user@spark.apache.org *Sent:* Friday, June 12, 2015 7:15 AM *Subject:* Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread Josh Rosen
If your job is dying due to out of memory errors in the post-shuffle stage, I'd consider the following approach for implementing de-duplication / distinct(): - Use sortByKey() to perform a full sort of your dataset. - Use mapPartitions() to iterate through each partition of the sorted dataset,

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-12 Thread Josh Rosen
Sent from my phone On Jun 11, 2015, at 8:43 AM, Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID wrote: hey guys Using Hive and Impala daily intensively. Want to transition to spark-sql in CLI mode Currently in my sandbox I am using the Spark (standalone mode) in the CDH

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-12 Thread Josh Rosen
It sounds like this might be caused by a memory configuration problem. In addition to looking at the executor memory, I'd also bump up the driver memory, since it appears that your shell is running out of memory when collecting a large query result. Sent from my phone On Jun 11, 2015, at

Re: union and reduceByKey wrong shuffle?

2015-06-01 Thread Josh Rosen
...@gmail.com wrote: Hi We are using spark 1.3.1 Avro-chill (tomorrow will check if its important) we register avro classes from java Avro 1.7.6 On May 31, 2015 22:37, Josh Rosen rosenvi...@gmail.com wrote: Which Spark version are you using? I'd like to understand whether this change could

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-01 Thread Josh Rosen
If you can't run a patched Spark version, then you could also consider using LZF compression instead, since that codec isn't affected by this bug. On Mon, Jun 1, 2015 at 3:32 PM, Andrew Or and...@databricks.com wrote: Hi Deepak, This is a notorious bug that is being tracked at

Re: union and reduceByKey wrong shuffle?

2015-06-02 Thread Josh Rosen
enough to split data into disk. We will work on it to understand and reproduce the problem(not first priority though...) On 1 June 2015 at 23:02, Josh Rosen rosenvi...@gmail.com wrote: How much work is to produce a small standalone reproduction? Can you create an Avro file with some mock

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-02 Thread Josh Rosen
My suggestion is that you change the Spark setting which controls the compression codec that Spark uses for internal data transfers. Set spark.io.compression.codec to lzf in your SparkConf. On Mon, Jun 1, 2015 at 8:46 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Hello Josh, Are you suggesting

Re: Performance degradation between spark 0.9.3 and 1.3.1

2015-05-22 Thread Josh Rosen
I don't think that 0.9.3 has been released, so I'm assuming that you're running on branch-0.9. There's been over 4000 commits between 0.9.3 and 1.3.1, so I'm afraid that this question doesn't have a concise answer: https://github.com/apache/spark/compare/branch-0.9...v1.3.1 To narrow down the

Re: Exception in spark

2015-08-11 Thread Josh Rosen
Can you share a query or stack trace? More information would make this question easier to answer. On Tue, Aug 11, 2015 at 8:50 PM, Ravisankar Mani rrav...@gmail.com wrote: Hi all, We got an exception like “org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to

Re: master compile broken for scala 2.11

2015-07-14 Thread Josh Rosen
I've opened a PR to fix this; please take a look: https://github.com/apache/spark/pull/7405 On Tue, Jul 14, 2015 at 11:22 AM, Koert Kuipers ko...@tresata.com wrote: it works for scala 2.10, but for 2.11 i get: [ERROR]

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Josh Rosen
Hi Jerry, Do you have speculation enabled? A write which produces one million files / output partitions might be using tons of driver memory via the OutputCommitCoordinator's bookkeeping data structures. On Sun, Oct 25, 2015 at 5:50 PM, Jerry Lam wrote: > Hi spark guys, >

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-27 Thread Josh Rosen
Hi Sjoerd, Did your job actually *fail* or did it just generate many spurious exceptions? While the stacktrace that you posted does indicate a bug, I don't think that it should have stopped query execution because Spark should have fallen back to an interpreted code path (note the "Failed to

Re: java.util.NoSuchElementException: key not found error

2015-10-21 Thread Josh Rosen
This is https://issues.apache.org/jira/browse/SPARK-10422, which has been fixed in Spark 1.5.1. On Wed, Oct 21, 2015 at 4:40 PM, Sourav Mazumder < sourav.mazumde...@gmail.com> wrote: > In 1.5.0 if I use randomSplit on a data frame I get this error. > > Here is teh code snippet - > > val

Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Josh Rosen
When we remove this, we should add a style-checker rule to ban the import so that it doesn't get added back by accident. On Mon, Nov 9, 2015 at 6:13 PM, Michael Armbrust wrote: > Yeah, we should probably remove that. > > On Mon, Nov 9, 2015 at 5:54 PM, Ted Yu

Re: out of memory error with Parquet

2015-11-13 Thread Josh Rosen
Tip: jump straight to 1.5.2; it has some key bug fixes. Sent from my phone > On Nov 13, 2015, at 10:02 PM, AlexG wrote: > > Never mind; when I switched to Spark 1.5.0, my code works as written and is > pretty fast! Looking at some Parquet related Spark jiras, it seems that

  1   2   >