unsubscribe

2023-07-19 Thread Josh Patterson
unsubscribe

Error Message Suggestion

2021-03-29 Thread Josh Herzberg
, the columns returned can be difficult to identify. The error would be more helpful and clear if the columns returned were included in the error message like so, [image: image.png] Happy to help make this happen if I can. Thanks! Josh

[Spark R]: Linear Mixed-Effects Models in Spark R

2018-03-26 Thread Josh Goldsborough
sked a couple years ago: https://stackoverflow.com/questions/39790820/mixed-effects-models-in-spark-or-other-technology But I wanted to ask again, in case anyone had an answer now. Thanks, Josh Goldsborough

Re: Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

2018-03-21 Thread Josh Goldsborough
Purna, It's a bit tangental to your original question but heads up that Amazon EKS is in Preview right now: https://aws.amazon.com/eks/ I don't know if it actually allows a nice interface between k8s hosted Spark & Lamda functions (my suspicion is it won't fix your problem), but might be

Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread Josh Rosen
My current best guess is that Spark does *not* fully support Hadoop 3.x because https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive shims for Hadoop 3.x) has not been resolved. There are also likely to be transitive dependency conflicts which will need to be resolved. On Mon, Jan

Re: [Spark Core] unhashable type: 'dict' during shuffle step

2017-07-18 Thread Josh Holbrook
not keeping my hopes up... Thanks, --Josh On Tue, Jul 18, 2017 at 3:17 PM, Josh Holbrook <josh.holbr...@fusion.net> wrote: > Hello! > > I'm running into a very strange issue with pretty much no hits on the > internet, and I'm hoping someone here can give me some protips! At

[Spark Core] unhashable type: 'dict' during shuffle step

2017-07-18 Thread Josh Holbrook
f it seems unrelated). They also pull from the same dataset. This is the last job I have to port over before we can sunset the old jobs and I'm at my wits' end, so any suggestions are highly appreciated! Thanks, --Josh

Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Josh Holbrook
0 partitions. This is pretty low, so you'll likely want to adjust this--I'm currently using the following because spark chokes on datasets that are bigger than about 2g per partition: { "Classification": "spark-defaults", "Properties": { "spark.d

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Josh Rosen
Spark SQL / Tungsten's explicitly-managed off-heap memory will be capped at spark.memory.offHeap.size bytes. This is purposely specified as an absolute size rather than as a percentage of the heap size in order to allow end users to tune Spark so that its overall memory consumption stays within

Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-10 Thread Josh Rosen
planning to add more tests to that patch). On Fri, Sep 9, 2016 at 10:37 AM Josh Rosen <joshro...@databricks.com> wrote: > cache() / persist() is definitely *not* supposed to affect the result of > a program, so the behavior that you're seeing is unexpected. > > I'll try to rep

Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-09 Thread Josh Rosen
cache() / persist() is definitely *not* supposed to affect the result of a program, so the behavior that you're seeing is unexpected. I'll try to reproduce this myself by caching in PySpark under heavy memory pressure, but in the meantime the following questions will help me to debug: - Does

Re: Execute function once on each node

2016-07-19 Thread Josh Asplund
tresata.com> wrote: > >> If you run hdfs on those ssds (with low replication factor) wouldn't it >> also effectively write to local disk with low latency? >> >> On Jul 18, 2016 21:54, "Josh Asplund" <joshaspl...@gmail.com> wrote: >> >> The s

Re: Execute function once on each node

2016-07-19 Thread Josh Asplund
are in right ip/hostname (or fail) and read the content of > the file. > > Not a 100% sure it will work though. > > On Tue, Jul 19, 2016, 2:54 AM Josh Asplund <joshaspl...@gmail.com> wrote: > >> The spark workers are running side-by-side with scientific simulation >> co

Re: Execute function once on each node

2016-07-18 Thread Josh Asplund
The spark workers are running side-by-side with scientific simulation code. The code writes output to local SSDs to keep latency low. Due to the volume of data being moved (10's of terabytes +), it isn't really feasible to copy the data to a global filesystem. Executing a function on each node

Re: A number of issues when running spark-ec2

2016-04-16 Thread Josh Rosen
Using a different machine / toolchain, I've downloaded and re-uploaded all of the 1.6.1 artifacts to that S3 bucket, so hopefully everything should be working now. Let me know if you still encounter any problems with unarchiving. On Sat, Apr 16, 2016 at 3:10 PM Ted Yu wrote:

Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Josh Rosen
AFAIK this is not being pushed down because it involves an implicit cast and we currently don't push casts into data sources or scans; see https://github.com/databricks/spark-redshift/issues/155 for a possibly-related discussion. On Thu, Apr 14, 2016 at 10:27 AM Mich Talebzadeh

Re: [HELP:]Save Spark Dataframe in Phoenix Table

2016-04-08 Thread Josh Mahonin
-check that your Spark configuration is setup with the right worker/driver classpath settings. and that the phoenix JARs contain the necessary phoenix-spark classes (e.g. org.apache.phoenix.spark.PhoenixRelation). If not, I suggest following up with Hortonworks. Josh On Fri, Apr 8, 2016 at 1:22 AM

Re: Kryo serialization mismatch in spark sql windowing function

2016-04-06 Thread Josh Rosen
Spark is compiled against a custom fork of Hive 1.2.1 which added shading of Protobuf and removed shading of Kryo. What I think that what's happening here is that stock Hive 1.2.1 is taking precedence so the Kryo instance that it's returning is an instance of shaded/relocated Hive version rather

Re: Spark master keeps running out of RAM

2016-03-31 Thread Josh Rosen
One possible cause of a standalone master OOMing is https://issues.apache.org/jira/browse/SPARK-6270. In 2.x, this will be fixed by https://issues.apache.org/jira/browse/SPARK-12299. In 1.x, one mitigation is to disable event logging. Another workaround would be to produce a patch which disables

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen
See the instructions in the Spark documentation: https://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 On Wed, Mar 16, 2016 at 7:05 PM satyajit vegesna wrote: > > > Hi, > > Scala version:2.11.7(had to upgrade the scala verison to enable case

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen
Err, whoops, looks like this is a user app and not building Spark itself, so you'll have to change your deps to use the 2.11 versions of Spark. e.g. spark-streaming_2.10 -> spark-streaming_2.11. On Wed, Mar 16, 2016 at 7:07 PM Josh Rosen <joshro...@databricks.com> wrote: > See the

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Josh Rosen
AFAIK we haven't actually broken 2.6 compatibility yet for PySpark itself, since Jenkins is still testing that configuration. I think the problem that you're seeing is that dev/run-tests / dev/run-tests-jenkins only work against Python 2.7+ right now. However, ./python/run-tests should be able to

Does anyone implement org.apache.spark.serializer.Serializer in their own code?

2016-03-07 Thread Josh Rosen
that the only reason it was a DeveloperAPI was Shark, but I'd like to confirm this by asking the community. Thanks, Josh

Re: Unresolved dep when building project with spark 1.6

2016-02-29 Thread Josh Rosen
Have you tried removing the leveldbjni files from your local ivy cache? My hunch is that this is a problem with some local cache state rather than the dependency simply being unavailable / not existing (note that the error message was "origin location must be absolute:[...]", not that the files

Re: bug for large textfiles on windows

2016-01-25 Thread Josh Rosen
Hi Christopher, What would be super helpful here is a standalone reproduction. Ideally this would be a single Scala file or set of commands that I can run in `spark-shell` in order to reproduce this. Ideally, this code would generate a giant file, then try to read it in a way that demonstrates

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Josh Rosen
Is speculation enabled? This TaskCommitDenied by driver error is thrown by writers who lost the race to commit an output partition. I don't think this had anything to do with key skew etc. Replacing the groupbykey with a count will mask this exception because the coordination does not get

Re: ibsnappyjava.so: failed to map segment from shared object

2016-01-11 Thread Josh Rosen
This is due to the snappy-java library; I think that you'll have to configure either java.io.tmpdir or org.xerial.snappy.tempdir; see https://github.com/xerial/snappy-java/blob/1198363176ad671d933fdaf0938b8b9e609c0d8a/src/main/java/org/xerial/snappy/SnappyLoader.java#L335 On Mon, Jan 11, 2016

Re: how garbage collection works on parallelize

2016-01-08 Thread Josh Rosen
It won't be GC'd as long as the RDD which results from `parallelize()` is kept around; that RDD keeps strong references to the parallelized collection's elements in order to enable fault-tolerance. On Fri, Jan 8, 2016 at 6:50 PM, jluan wrote: > Hi, > > I am curious about

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
If users are able to install Spark 2.0 on their RHEL clusters, then I imagine that they're also capable of installing a standalone Python alongside that Spark version (without changing Python systemwide). For instance, Anaconda/Miniconda make it really easy to install Python 2.7.x/3.x without

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
PATH. I understand that >> other administrators may not be so compliant. >> >> Saw a small bit about the java version in there; does Spark currently >> prefer Java 1.8.x? >> >> —Ken >> >> On Jan 5, 2016, at 6:08 PM, Josh Rosen <joshro...@databricks.com&

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
5, 2016 at 3:07 PM, Josh Rosen <joshro...@databricks.com> wrote: > Yep, the driver and executors need to have compatible Python versions. I > think that there are some bytecode-level incompatibilities between 2.6 and > 2.7 which would impact the deserialization of Python closures, s

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
;> even if python 2.7 was needed only on this one machine that launches >>>> the app we can not ship it with our software because its gpl licensed, so >>>> the client would have to download it and install it themselves, and this >>>> would mean its an i

Re: Applicaiton Detail UI change

2015-12-21 Thread Josh Rosen
In the script / environment which launches your Spark driver, try setting the SPARK_PUBLIC_DNS environment variable to point to a publicly-accessible hostname. See https://spark.apache.org/docs/latest/configuration.html#environment-variables for more details. This environment variable also

Re: fishing for help!

2015-12-21 Thread Josh Rosen
@Eran, are Server 1 and Server 2 both part of the same cluster / do they have similar positions in the network topology w.r.t the Spark executors? If Server 1 had fast network access to the executors but Server 2 was across a WAN then I'd expect the job to run slower from Server 2 duet to the

Re: About Spark On Hbase

2015-12-15 Thread Josh Mahonin
And as yet another option, there is https://phoenix.apache.org/phoenix_spark.html It however requires that you are also using Phoenix in conjunction with HBase. On Tue, Dec 15, 2015 at 4:16 PM, Ted Yu wrote: > There is also >

Re: Spark 1.3.1 - Does SparkConext in multi-threaded env requires SparkEnv.set(env) anymore

2015-12-10 Thread Josh Rosen
Nope, you shouldn't have to do that anymore. As of https://github.com/apache/spark/pull/2624, which is in Spark 1.2.0+, SparkEnv's thread-local stuff was removed and replaced by a simple global variable (since it was used in an *effectively* global way before (see my comments on that PR)). As a

Re: Spark UI - Streaming Tab

2015-12-04 Thread Josh Rosen
The Streaming tab is only supported in the live UI, not in the History Server. On Fri, Dec 4, 2015 at 9:31 AM, patcharee wrote: > I ran streaming jobs, but no streaming tab appeared for those jobs. > > Patcharee > > > > On 04. des. 2015 18:12, PhuDuc Nguyen wrote: > >

Re: Problem with RDD of (Long, Byte[Array])

2015-12-03 Thread Josh Rosen
Are they keys that you're joining on the bye arrays themselves? If so, that's not likely to work because of how Java computes arrays' hashCodes; see https://issues.apache.org/jira/browse/SPARK-597. If this turns out to be the problem, we should look into strengthening the checks for array-type

Re: Low Latency SQL query

2015-12-01 Thread Josh Rosen
Use a long-lived SparkContext rather than creating a new one for each query. On Tue, Dec 1, 2015 at 11:52 AM Andrés Ivaldi wrote: > Hi, > > I'd like to use spark to perform some transformations over data stored > inSQL, but I need low Latency, I'm doing some test and I run

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Josh Rosen
Yep, you shouldn't enable *spark.driver.allowMultipleContexts* since it has the potential to cause extremely difficult-to-debug task failures; it was originally introduced as an escape-hatch to allow users whose workloads happened to work "by accident" to continue using multiple active contexts,

Re: spark.cleaner.ttl for 1.4.1

2015-11-30 Thread Josh Rosen
AFAIK the ContextCleaner should perform all of the cleaning *as long as garbage collection is performed frequently enough on the driver*. See https://issues.apache.org/jira/browse/SPARK-7689 and https://github.com/apache/spark/pull/6220#issuecomment-102950055 for discussion of this technicality.

Re: out of memory error with Parquet

2015-11-13 Thread Josh Rosen
Tip: jump straight to 1.5.2; it has some key bug fixes. Sent from my phone > On Nov 13, 2015, at 10:02 PM, AlexG wrote: > > Never mind; when I switched to Spark 1.5.0, my code works as written and is > pretty fast! Looking at some Parquet related Spark jiras, it seems that

Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Josh Rosen
When we remove this, we should add a style-checker rule to ban the import so that it doesn't get added back by accident. On Mon, Nov 9, 2015 at 6:13 PM, Michael Armbrust wrote: > Yeah, we should probably remove that. > > On Mon, Nov 9, 2015 at 5:54 PM, Ted Yu

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-27 Thread Josh Rosen
Hi Sjoerd, Did your job actually *fail* or did it just generate many spurious exceptions? While the stacktrace that you posted does indicate a bug, I don't think that it should have stopped query execution because Spark should have fallen back to an interpreted code path (note the "Failed to

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Josh Rosen
Hi Jerry, Do you have speculation enabled? A write which produces one million files / output partitions might be using tons of driver memory via the OutputCommitCoordinator's bookkeeping data structures. On Sun, Oct 25, 2015 at 5:50 PM, Jerry Lam wrote: > Hi spark guys, >

Re: java.util.NoSuchElementException: key not found error

2015-10-21 Thread Josh Rosen
This is https://issues.apache.org/jira/browse/SPARK-10422, which has been fixed in Spark 1.5.1. On Wed, Oct 21, 2015 at 4:40 PM, Sourav Mazumder < sourav.mazumde...@gmail.com> wrote: > In 1.5.0 if I use randomSplit on a data frame I get this error. > > Here is teh code snippet - > > val

Re: spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

2015-10-14 Thread Josh Rosen
Can you report this as an issue at https://github.com/databricks/spark-avro/issues so that it's easier to track? Thanks! On Wed, Oct 14, 2015 at 1:38 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > I save my dataframe to avro with spark-avro 1.0.0 and it looks like this > (using

Fwd: Add row IDs column to data frame

2015-10-02 Thread Josh Levy-Kramer
index", index_col) The error I get is: org.apache.spark.sql.AnalysisException: resolved attribute(s) id#76L missing from col1#69,col2#70 in operator !Project [col1#69,col2#70,id#76L AS index#77L]; Is this the right to add an ID column or is this a bug? Many thanks. Josh

Re: Potential racing condition in DAGScheduler when Spark 1.5 caching

2015-09-24 Thread Josh Rosen
I believe that this is an instance of https://issues.apache.org/jira/browse/SPARK-10422, which should be fixed in upcoming 1.5.1 release. On Thu, Sep 24, 2015 at 12:52 PM, Mark Hamstra wrote: > Where do you see a race in the DAGScheduler? On a quick look at your >

Does anyone use ShuffleDependency directly?

2015-09-18 Thread Josh Rosen
Does anyone use ShuffleDependency directly in their Spark code or libraries? If so, how do you use it? Similarly, does anyone use ShuffleHandle

Re: Re: Table is modified by DataFrameWriter

2015-09-16 Thread Josh Rosen
What are your JDBC properties configured to? Do you have overwrite mode enabled? On Wed, Sep 16, 2015 at 7:39 PM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Spark-1.4.1 > > > *From:* Ted Yu > *Date:* 2015-09-17 10:29 > *To:* guoqing0...@yahoo.com.hk >

Re: Exception in spark

2015-08-11 Thread Josh Rosen
Can you share a query or stack trace? More information would make this question easier to answer. On Tue, Aug 11, 2015 at 8:50 PM, Ravisankar Mani rrav...@gmail.com wrote: Hi all, We got an exception like “org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to

Re: master compile broken for scala 2.11

2015-07-14 Thread Josh Rosen
I've opened a PR to fix this; please take a look: https://github.com/apache/spark/pull/7405 On Tue, Jul 14, 2015 at 11:22 AM, Koert Kuipers ko...@tresata.com wrote: it works for scala 2.10, but for 2.11 i get: [ERROR]

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to read chunk

2015-06-25 Thread Josh Rosen
Which Spark version are you using? AFAIK the corruption bugs in sort-based shuffle should have been fixed in newer Spark releases. On Wed, Jun 24, 2015 at 12:25 PM, Piero Cinquegrana pcinquegr...@marketshare.com wrote: Switching spark.shuffle.manager from sort to hash fixed this issue as

Re: org.apache.spark.sql.ScalaReflectionLock

2015-06-23 Thread Josh Rosen
Mind filing a JIRA? On Tue, Jun 23, 2015 at 9:34 AM, Koert Kuipers ko...@tresata.com wrote: just a heads up, i was doing some basic coding using DataFrame, Row, StructType, etc. and i ended up with deadlocks in my sbt tests due to the usage of ScalaReflectionLock.synchronized in the spark

Re: Serializer not switching

2015-06-22 Thread Josh Rosen
My hunch is that you changed spark.serializer to Kryo but left spark.closureSerializer unmodified, so it's still using Java for closure serialization. Kryo doesn't really work as a closure serializer but there's an open pull request to fix this: https://github.com/apache/spark/pull/6361 On Mon,

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread Josh Rosen
If your job is dying due to out of memory errors in the post-shuffle stage, I'd consider the following approach for implementing de-duplication / distinct(): - Use sortByKey() to perform a full sort of your dataset. - Use mapPartitions() to iterate through each partition of the sorted dataset,

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-13 Thread Josh Rosen
-- *From:* Josh Rosen rosenvi...@gmail.com *To:* Sanjay Subramanian sanjaysubraman...@yahoo.com *Cc:* user@spark.apache.org user@spark.apache.org *Sent:* Friday, June 12, 2015 7:15 AM *Subject:* Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-12 Thread Josh Rosen
Sent from my phone On Jun 11, 2015, at 8:43 AM, Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID wrote: hey guys Using Hive and Impala daily intensively. Want to transition to spark-sql in CLI mode Currently in my sandbox I am using the Spark (standalone mode) in the CDH

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-12 Thread Josh Rosen
It sounds like this might be caused by a memory configuration problem. In addition to looking at the executor memory, I'd also bump up the driver memory, since it appears that your shell is running out of memory when collecting a large query result. Sent from my phone On Jun 11, 2015, at

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-11 Thread Josh Mahonin
with a GMane link to the thread? Good luck, Josh On Thu, Jun 11, 2015 at 2:38 AM, Jeroen Vlek j.v...@anchormen.nl wrote: Hi Josh, That worked! Thank you so much! (I can't believe it was something so obvious ;) ) If you care about such a thing you could answer my question here for bounty

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-10 Thread Josh Mahonin
Josh On Wed, Jun 10, 2015 at 4:11 AM, Jeroen Vlek j.v...@anchormen.nl wrote: Hi Josh, Thank you for your effort. Looking at your code, I feel that mine is semantically the same, except written in Java. The dependencies in the pom.xml all have the scope provided. The job is submitted

Re: Fully in-memory shuffles

2015-06-10 Thread Josh Rosen
There's a discussion of this at https://github.com/apache/spark/pull/5403 On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet cjno...@gmail.com wrote: Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)?

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-09 Thread Josh Mahonin
suspect that keeping all of the spark and phoenix dependencies marked as 'provided', and including the Phoenix client JAR in the Spark classpath would work as well. Good luck, Josh On Tue, Jun 9, 2015 at 4:40 AM, Jeroen Vlek j.v...@anchormen.nl wrote: Hi, I posted a question with regards

Re: union and reduceByKey wrong shuffle?

2015-06-02 Thread Josh Rosen
enough to split data into disk. We will work on it to understand and reproduce the problem(not first priority though...) On 1 June 2015 at 23:02, Josh Rosen rosenvi...@gmail.com wrote: How much work is to produce a small standalone reproduction? Can you create an Avro file with some mock

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-02 Thread Josh Rosen
My suggestion is that you change the Spark setting which controls the compression codec that Spark uses for internal data transfers. Set spark.io.compression.codec to lzf in your SparkConf. On Mon, Jun 1, 2015 at 8:46 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Hello Josh, Are you suggesting

Re: union and reduceByKey wrong shuffle?

2015-06-01 Thread Josh Rosen
...@gmail.com wrote: Hi We are using spark 1.3.1 Avro-chill (tomorrow will check if its important) we register avro classes from java Avro 1.7.6 On May 31, 2015 22:37, Josh Rosen rosenvi...@gmail.com wrote: Which Spark version are you using? I'd like to understand whether this change could

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-01 Thread Josh Rosen
If you can't run a patched Spark version, then you could also consider using LZF compression instead, since that codec isn't affected by this bug. On Mon, Jun 1, 2015 at 3:32 PM, Andrew Or and...@databricks.com wrote: Hi Deepak, This is a notorious bug that is being tracked at

Re: Performance degradation between spark 0.9.3 and 1.3.1

2015-05-22 Thread Josh Rosen
I don't think that 0.9.3 has been released, so I'm assuming that you're running on branch-0.9. There's been over 4000 commits between 0.9.3 and 1.3.1, so I'm afraid that this question doesn't have a concise answer: https://github.com/apache/spark/compare/branch-0.9...v1.3.1 To narrow down the

Re: Does long-lived SparkContext hold on to executor resources?

2015-05-12 Thread Josh Rosen
I would be cautious regarding use of spark.cleaner.ttl, as it can lead to confusing error messages if time-based cleaning deletes resources that are still needed. See my comment at

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-08 Thread Josh Rosen
Do you have any more specific profiling data that you can share? I'm curious to know where AppendOnlyMap.changeValue is being called from. On Fri, May 8, 2015 at 1:26 PM, Michal Haris michal.ha...@visualdna.com wrote: +dev On 6 May 2015 10:45, Michal Haris michal.ha...@visualdna.com wrote:

Python 3 support for PySpark has been merged into master

2015-04-16 Thread Josh Rosen
the PySpark unit tests locally to make sure that the change still work correctly in older branches. I can also help with backports / fixing conflicts. Thanks to Davies Liu, Shane Knapp, Thom Neale, Xiangrui Meng, and everyone else who helped with this patch. - Josh

Re: A problem with Spark 1.3 artifacts

2015-04-06 Thread Josh Rosen
to continue debugging this issue, I think we should move this discussion over to JIRA so it's easier to track and reference. Hope this helps, Josh On Thu, Apr 2, 2015 at 7:34 AM, Jacek Lewandowski jacek.lewandow...@datastax.com wrote: A very simple example which works well with Spark 1.2

Re: Streaming scheduling delay

2015-03-01 Thread Josh J
On Fri, Feb 13, 2015 at 2:21 AM, Gerard Maas gerard.m...@gmail.com wrote: KafkaOutputServicePool Could you please give an example code of how KafkaOutputServicePool would look like? When I tried object pooling I end up with various not serializable exceptions. Thanks! Josh

Re: throughput in the web console?

2015-02-25 Thread Josh J
at 10:29 PM, Josh J joshjd...@gmail.com wrote: Hi, I plan to run a parameter search varying the number of cores, epoch, and parallelism. The web console provides a way to archive the previous runs, though is there a way to view in the console the throughput? Rather than logging

Re: throughput in the web console?

2015-02-25 Thread Josh J
On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das ak...@sigmoidanalytics.com wrote: For SparkStreaming applications, there is already a tab called Streaming which displays the basic statistics. Would I just need to extend this tab to add the throughput?

throughput in the web console?

2015-02-24 Thread Josh J
the logs files to the web console processing times? Thanks, Josh

Re: Which OutputCommitter to use for S3?

2015-02-20 Thread Josh Rosen
We (Databricks) use our own DirectOutputCommitter implementation, which is a couple tens of lines of Scala code. The class would almost entirely be a no-op except we took some care to properly handle the _SUCCESS file. On Fri, Feb 20, 2015 at 3:52 PM, Mingyu Kim m...@palantir.com wrote: I

measuring time taken in map, reduceByKey, filter, flatMap

2015-01-30 Thread Josh J
Hi, I have a stream pipeline which invokes map, reduceByKey, filter, and flatMap. How can I measure the time taken in each stage? Thanks, Josh

Re: performance of saveAsTextFile moving files from _temporary

2015-01-27 Thread Josh Walton
I'm not sure how to confirm how the moving is happening, however, one of the jobs just completed that I was talking about with 9k files of 4mb each. Spark UI showed the job being complete after ~2 hours. The last four hours of the job was just moving the files from _temporary to their final

Re: spark-shell has syntax error on windows.

2015-01-23 Thread Josh Rosen
Do you mind filing a JIRA issue for this which includes the actual error message string that you saw? https://issues.apache.org/jira/browse/SPARK On Thu, Jan 22, 2015 at 8:31 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: I am not sure if you get the same exception as I do --

Re: Recent Git Builds Application WebUI Problem and Exception Stating Log directory /tmp/spark-events does not exist.

2015-01-18 Thread Josh Rosen
This looks like a bug in the master branch of Spark, related to some recent changes to EventLoggingListener. You can reproduce this bug on a fresh Spark checkout by running ./bin/spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/tmp/nonexistent-dir where

Re: how to run python app in yarn?

2015-01-14 Thread Josh Rosen
There's an open PR for supporting yarn-cluster mode in PySpark: https://github.com/apache/spark/pull/3976 (currently blocked on reviewer attention / time) On Wed, Jan 14, 2015 at 3:16 PM, Marcelo Vanzin van...@cloudera.com wrote: As the error message says... On Wed, Jan 14, 2015 at 3:14 PM,

Re: dockerized spark executor on mesos?

2015-01-14 Thread Josh J
We have dockerized Spark Master and worker(s) separately and are using it in our dev environment. Is this setup available on github or dockerhub? On Tue, Dec 9, 2014 at 3:50 PM, Venkat Subramanian vsubr...@gmail.com wrote: We have dockerized Spark Master and worker(s) separately and are

spark standalone master with workers on two nodes

2015-01-13 Thread Josh J
Hi, I'm trying to run Spark Streaming standalone on two nodes. I'm able to run on a single node fine. I start both workers and it registers in the Spark UI. However, the application says SparkDeploySchedulerBackend: Asked to remove non-existent executor 2 Any ideas? Thanks, Josh

Re: train many decision tress with a single spark job

2015-01-12 Thread Josh Buffum
(data) but just to deal with it on whatever spark worker is handling kvp? Does that question make sense? Thanks! Josh On Sun, Jan 11, 2015 at 4:12 AM, Sean Owen so...@cloudera.com wrote: You just mean you want to divide the data set into N subsets, and do that dividing by user, not make one

Re: train many decision tress with a single spark job

2015-01-12 Thread Josh Buffum
are using RDDs inside RDDs. But I am also not sure you should do what it looks like you are trying to do. On Jan 13, 2015 12:32 AM, Josh Buffum jbuf...@gmail.com wrote: Sean, Thanks for the response. Is there some subtle difference between one model partitioned by N users or N models per each 1 user

train many decision tress with a single spark job

2015-01-10 Thread Josh Buffum
I've got a data set of activity by user. For each user, I'd like to train a decision tree model. I currently have the feature creation step implemented in Spark and would naturally like to use mllib's decision tree model. However, it looks like the decision tree model expects the whole RDD and

Re: Spark Standalone Cluster not correctly configured

2015-01-08 Thread Josh Rosen
Can you please file a JIRA issue for this? This will make it easier to triage this issue. https://issues.apache.org/jira/browse/SPARK Thanks, Josh On Thu, Jan 8, 2015 at 2:34 AM, frodo777 roberto.vaquer...@bitmonlab.com wrote: Hello everyone. With respect to the configuration problem

Re: Mesos resource allocation

2015-01-05 Thread Josh Devins
thoughts and actually very curious about how others are running Spark on Mesos with large heaps (as a result of large memory machines). Perhaps this is a non-issue when we have more multi-tenancy in the cluster, but for now, this is not the case. Thanks, Josh On 24 December 2014 at 06:22, Tim Chen

Re: Shuffle Problems in 1.2.0

2015-01-04 Thread Josh Rosen
hard to say from this error trace alone. On December 30, 2014 at 5:17:08 PM, Sven Krasser (kras...@gmail.com) wrote: Hey Josh, I am still trying to prune this to a minimal example, but it has been tricky since scale seems to be a factor. The job runs over ~720GB of data (the cluster's total RAM

Re: spark.akka.frameSize limit error

2015-01-04 Thread Josh Rosen
fix. In the meantime, I recommend that you increase your Akka frame size. On Sat, Jan 3, 2015 at 8:51 PM, Saeed Shahrivari saeed.shahriv...@gmail.com wrote: I use the 1.2 version. On Sun, Jan 4, 2015 at 3:01 AM, Josh Rosen rosenvi...@gmail.com wrote: Which version of Spark are you using

Re: Repartition Memory Leak

2015-01-04 Thread Josh Rosen
@Brad, I'm guessing that the additional memory usage is coming from the shuffle performed by coalesce, so that at least explains the memory blowup. On Sun, Jan 4, 2015 at 10:16 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You can try: - Using KryoSerializer - Enabling RDD Compression -

Re: spark.akka.frameSize limit error

2015-01-03 Thread Josh Rosen
Which version of Spark are you using? It seems like the issue here is that the map output statuses are too large to fit in the Akka frame size. This issue has been fixed in Spark 1.2 by using a different encoding for map outputs for jobs with many reducers (

Re: DAG info

2015-01-01 Thread Josh Rosen
This log message is normal; in this case, this message is saying that the final stage needed to compute your job does not have any dependencies / parent stages and that there are no parent stages that need to be computed. On Thu, Jan 1, 2015 at 11:02 PM, shahid sha...@trialx.com wrote: hi guys

Re: NullPointerException

2014-12-31 Thread Josh Rosen
Which version of Spark are you using? On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I get this following Exception when I submit spark application that calculates the frequency of characters in a file. Especially, when I increase the size of data, I

Re: NullPointerException

2014-12-31 Thread Josh Rosen
:04 PM, Josh Rosen rosenvi...@gmail.com wrote: Which version of Spark are you using? On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I get this following Exception when I submit spark application that calculates the frequency of characters in a file

Re: Shuffle Problems in 1.2.0

2014-12-30 Thread Josh Rosen
Hi Sven, Do you have a small example program that you can share which will allow me to reproduce this issue? If you have a workload that runs into this, you should be able to keep iteratively simplifying the job and reducing the data set size until you hit a fairly minimal reproduction (assuming

Re: SparkContext with error from PySpark

2014-12-30 Thread Josh Rosen
To configure the Python executable used by PySpark, see the Using the Shell Python section in the Spark Programming Guide: https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell You can set the PYSPARK_PYTHON environment variable to choose the Python executable that will be

Re: action progress in ipython notebook?

2014-12-29 Thread Josh Rosen
Josh Is there documentation available for status API? I would like to use it. Thanks, Aniket On Sun Dec 28 2014 at 02:37:32 Josh Rosen rosenvi...@gmail.com wrote: The console progress bars are implemented on top of a new stable status API that was added in Spark 1.2. It's possible

  1   2   >