Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread Josh Rosen
My current best guess is that Spark does *not* fully support Hadoop 3.x because https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive shims for Hadoop 3.x) has not been resolved. There are also likely to be transitive dependency conflicts which will need to be resolved. On Mon, Jan

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Josh Rosen
Spark SQL / Tungsten's explicitly-managed off-heap memory will be capped at spark.memory.offHeap.size bytes. This is purposely specified as an absolute size rather than as a percentage of the heap size in order to allow end users to tune Spark so that its overall memory consumption stays within

Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-10 Thread Josh Rosen
planning to add more tests to that patch). On Fri, Sep 9, 2016 at 10:37 AM Josh Rosen <joshro...@databricks.com> wrote: > cache() / persist() is definitely *not* supposed to affect the result of > a program, so the behavior that you're seeing is unexpected. > > I'll try to rep

Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-09 Thread Josh Rosen
cache() / persist() is definitely *not* supposed to affect the result of a program, so the behavior that you're seeing is unexpected. I'll try to reproduce this myself by caching in PySpark under heavy memory pressure, but in the meantime the following questions will help me to debug: - Does

Re: A number of issues when running spark-ec2

2016-04-16 Thread Josh Rosen
Using a different machine / toolchain, I've downloaded and re-uploaded all of the 1.6.1 artifacts to that S3 bucket, so hopefully everything should be working now. Let me know if you still encounter any problems with unarchiving. On Sat, Apr 16, 2016 at 3:10 PM Ted Yu wrote:

Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Josh Rosen
AFAIK this is not being pushed down because it involves an implicit cast and we currently don't push casts into data sources or scans; see https://github.com/databricks/spark-redshift/issues/155 for a possibly-related discussion. On Thu, Apr 14, 2016 at 10:27 AM Mich Talebzadeh

Re: Kryo serialization mismatch in spark sql windowing function

2016-04-06 Thread Josh Rosen
Spark is compiled against a custom fork of Hive 1.2.1 which added shading of Protobuf and removed shading of Kryo. What I think that what's happening here is that stock Hive 1.2.1 is taking precedence so the Kryo instance that it's returning is an instance of shaded/relocated Hive version rather

Re: Spark master keeps running out of RAM

2016-03-31 Thread Josh Rosen
One possible cause of a standalone master OOMing is https://issues.apache.org/jira/browse/SPARK-6270. In 2.x, this will be fixed by https://issues.apache.org/jira/browse/SPARK-12299. In 1.x, one mitigation is to disable event logging. Another workaround would be to produce a patch which disables

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen
See the instructions in the Spark documentation: https://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 On Wed, Mar 16, 2016 at 7:05 PM satyajit vegesna wrote: > > > Hi, > > Scala version:2.11.7(had to upgrade the scala verison to enable case

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen
Err, whoops, looks like this is a user app and not building Spark itself, so you'll have to change your deps to use the 2.11 versions of Spark. e.g. spark-streaming_2.10 -> spark-streaming_2.11. On Wed, Mar 16, 2016 at 7:07 PM Josh Rosen <joshro...@databricks.com> wrote: > See the

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Josh Rosen
AFAIK we haven't actually broken 2.6 compatibility yet for PySpark itself, since Jenkins is still testing that configuration. I think the problem that you're seeing is that dev/run-tests / dev/run-tests-jenkins only work against Python 2.7+ right now. However, ./python/run-tests should be able to

Does anyone implement org.apache.spark.serializer.Serializer in their own code?

2016-03-07 Thread Josh Rosen
Does anyone implement Spark's serializer interface (org.apache.spark.serializer.Serializer) in your own third-party code? If so, please let me know because I'd like to change this interface from a DeveloperAPI to private[spark] in Spark 2.0 in order to do some cleanup and refactoring. I think that

Re: Unresolved dep when building project with spark 1.6

2016-02-29 Thread Josh Rosen
Have you tried removing the leveldbjni files from your local ivy cache? My hunch is that this is a problem with some local cache state rather than the dependency simply being unavailable / not existing (note that the error message was "origin location must be absolute:[...]", not that the files

Re: bug for large textfiles on windows

2016-01-25 Thread Josh Rosen
Hi Christopher, What would be super helpful here is a standalone reproduction. Ideally this would be a single Scala file or set of commands that I can run in `spark-shell` in order to reproduce this. Ideally, this code would generate a giant file, then try to read it in a way that demonstrates

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Josh Rosen
Is speculation enabled? This TaskCommitDenied by driver error is thrown by writers who lost the race to commit an output partition. I don't think this had anything to do with key skew etc. Replacing the groupbykey with a count will mask this exception because the coordination does not get

Re: ibsnappyjava.so: failed to map segment from shared object

2016-01-11 Thread Josh Rosen
This is due to the snappy-java library; I think that you'll have to configure either java.io.tmpdir or org.xerial.snappy.tempdir; see https://github.com/xerial/snappy-java/blob/1198363176ad671d933fdaf0938b8b9e609c0d8a/src/main/java/org/xerial/snappy/SnappyLoader.java#L335 On Mon, Jan 11, 2016

Re: how garbage collection works on parallelize

2016-01-08 Thread Josh Rosen
It won't be GC'd as long as the RDD which results from `parallelize()` is kept around; that RDD keeps strong references to the parallelized collection's elements in order to enable fault-tolerance. On Fri, Jan 8, 2016 at 6:50 PM, jluan wrote: > Hi, > > I am curious about

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
If users are able to install Spark 2.0 on their RHEL clusters, then I imagine that they're also capable of installing a standalone Python alongside that Spark version (without changing Python systemwide). For instance, Anaconda/Miniconda make it really easy to install Python 2.7.x/3.x without

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
PATH. I understand that >> other administrators may not be so compliant. >> >> Saw a small bit about the java version in there; does Spark currently >> prefer Java 1.8.x? >> >> —Ken >> >> On Jan 5, 2016, at 6:08 PM, Josh Rosen <joshro...@databricks.com&

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
5, 2016 at 3:07 PM, Josh Rosen <joshro...@databricks.com> wrote: > Yep, the driver and executors need to have compatible Python versions. I > think that there are some bytecode-level incompatibilities between 2.6 and > 2.7 which would impact the deserialization of Python closures, s

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Josh Rosen
;> even if python 2.7 was needed only on this one machine that launches >>>> the app we can not ship it with our software because its gpl licensed, so >>>> the client would have to download it and install it themselves, and this >>>> would mean its an i

Re: Applicaiton Detail UI change

2015-12-21 Thread Josh Rosen
In the script / environment which launches your Spark driver, try setting the SPARK_PUBLIC_DNS environment variable to point to a publicly-accessible hostname. See https://spark.apache.org/docs/latest/configuration.html#environment-variables for more details. This environment variable also

Re: fishing for help!

2015-12-21 Thread Josh Rosen
@Eran, are Server 1 and Server 2 both part of the same cluster / do they have similar positions in the network topology w.r.t the Spark executors? If Server 1 had fast network access to the executors but Server 2 was across a WAN then I'd expect the job to run slower from Server 2 duet to the

Re: Spark 1.3.1 - Does SparkConext in multi-threaded env requires SparkEnv.set(env) anymore

2015-12-10 Thread Josh Rosen
Nope, you shouldn't have to do that anymore. As of https://github.com/apache/spark/pull/2624, which is in Spark 1.2.0+, SparkEnv's thread-local stuff was removed and replaced by a simple global variable (since it was used in an *effectively* global way before (see my comments on that PR)). As a

Re: Spark UI - Streaming Tab

2015-12-04 Thread Josh Rosen
The Streaming tab is only supported in the live UI, not in the History Server. On Fri, Dec 4, 2015 at 9:31 AM, patcharee wrote: > I ran streaming jobs, but no streaming tab appeared for those jobs. > > Patcharee > > > > On 04. des. 2015 18:12, PhuDuc Nguyen wrote: > >

Re: Problem with RDD of (Long, Byte[Array])

2015-12-03 Thread Josh Rosen
Are they keys that you're joining on the bye arrays themselves? If so, that's not likely to work because of how Java computes arrays' hashCodes; see https://issues.apache.org/jira/browse/SPARK-597. If this turns out to be the problem, we should look into strengthening the checks for array-type

Re: Low Latency SQL query

2015-12-01 Thread Josh Rosen
Use a long-lived SparkContext rather than creating a new one for each query. On Tue, Dec 1, 2015 at 11:52 AM Andrés Ivaldi wrote: > Hi, > > I'd like to use spark to perform some transformations over data stored > inSQL, but I need low Latency, I'm doing some test and I run

Re: Question about yarn-cluster mode and spark.driver.allowMultipleContexts

2015-12-01 Thread Josh Rosen
Yep, you shouldn't enable *spark.driver.allowMultipleContexts* since it has the potential to cause extremely difficult-to-debug task failures; it was originally introduced as an escape-hatch to allow users whose workloads happened to work "by accident" to continue using multiple active contexts,

Re: spark.cleaner.ttl for 1.4.1

2015-11-30 Thread Josh Rosen
AFAIK the ContextCleaner should perform all of the cleaning *as long as garbage collection is performed frequently enough on the driver*. See https://issues.apache.org/jira/browse/SPARK-7689 and https://github.com/apache/spark/pull/6220#issuecomment-102950055 for discussion of this technicality.

Re: out of memory error with Parquet

2015-11-13 Thread Josh Rosen
Tip: jump straight to 1.5.2; it has some key bug fixes. Sent from my phone > On Nov 13, 2015, at 10:02 PM, AlexG wrote: > > Never mind; when I switched to Spark 1.5.0, my code works as written and is > pretty fast! Looking at some Parquet related Spark jiras, it seems that

Re: Anybody hit this issue in spark shell?

2015-11-09 Thread Josh Rosen
When we remove this, we should add a style-checker rule to ban the import so that it doesn't get added back by accident. On Mon, Nov 9, 2015 at 6:13 PM, Michael Armbrust wrote: > Yeah, we should probably remove that. > > On Mon, Nov 9, 2015 at 5:54 PM, Ted Yu

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-27 Thread Josh Rosen
Hi Sjoerd, Did your job actually *fail* or did it just generate many spurious exceptions? While the stacktrace that you posted does indicate a bug, I don't think that it should have stopped query execution because Spark should have fallen back to an interpreted code path (note the "Failed to

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Josh Rosen
Hi Jerry, Do you have speculation enabled? A write which produces one million files / output partitions might be using tons of driver memory via the OutputCommitCoordinator's bookkeeping data structures. On Sun, Oct 25, 2015 at 5:50 PM, Jerry Lam wrote: > Hi spark guys, >

Re: java.util.NoSuchElementException: key not found error

2015-10-21 Thread Josh Rosen
This is https://issues.apache.org/jira/browse/SPARK-10422, which has been fixed in Spark 1.5.1. On Wed, Oct 21, 2015 at 4:40 PM, Sourav Mazumder < sourav.mazumde...@gmail.com> wrote: > In 1.5.0 if I use randomSplit on a data frame I get this error. > > Here is teh code snippet - > > val

Re: spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

2015-10-14 Thread Josh Rosen
Can you report this as an issue at https://github.com/databricks/spark-avro/issues so that it's easier to track? Thanks! On Wed, Oct 14, 2015 at 1:38 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > I save my dataframe to avro with spark-avro 1.0.0 and it looks like this > (using

Re: Potential racing condition in DAGScheduler when Spark 1.5 caching

2015-09-24 Thread Josh Rosen
I believe that this is an instance of https://issues.apache.org/jira/browse/SPARK-10422, which should be fixed in upcoming 1.5.1 release. On Thu, Sep 24, 2015 at 12:52 PM, Mark Hamstra wrote: > Where do you see a race in the DAGScheduler? On a quick look at your >

Does anyone use ShuffleDependency directly?

2015-09-18 Thread Josh Rosen
Does anyone use ShuffleDependency directly in their Spark code or libraries? If so, how do you use it? Similarly, does anyone use ShuffleHandle

Re: Re: Table is modified by DataFrameWriter

2015-09-16 Thread Josh Rosen
What are your JDBC properties configured to? Do you have overwrite mode enabled? On Wed, Sep 16, 2015 at 7:39 PM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Spark-1.4.1 > > > *From:* Ted Yu > *Date:* 2015-09-17 10:29 > *To:* guoqing0...@yahoo.com.hk >

Re: Exception in spark

2015-08-11 Thread Josh Rosen
Can you share a query or stack trace? More information would make this question easier to answer. On Tue, Aug 11, 2015 at 8:50 PM, Ravisankar Mani rrav...@gmail.com wrote: Hi all, We got an exception like “org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to

Re: master compile broken for scala 2.11

2015-07-14 Thread Josh Rosen
I've opened a PR to fix this; please take a look: https://github.com/apache/spark/pull/7405 On Tue, Jul 14, 2015 at 11:22 AM, Koert Kuipers ko...@tresata.com wrote: it works for scala 2.10, but for 2.11 i get: [ERROR]

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to read chunk

2015-06-25 Thread Josh Rosen
Which Spark version are you using? AFAIK the corruption bugs in sort-based shuffle should have been fixed in newer Spark releases. On Wed, Jun 24, 2015 at 12:25 PM, Piero Cinquegrana pcinquegr...@marketshare.com wrote: Switching spark.shuffle.manager from sort to hash fixed this issue as

Re: org.apache.spark.sql.ScalaReflectionLock

2015-06-23 Thread Josh Rosen
Mind filing a JIRA? On Tue, Jun 23, 2015 at 9:34 AM, Koert Kuipers ko...@tresata.com wrote: just a heads up, i was doing some basic coding using DataFrame, Row, StructType, etc. and i ended up with deadlocks in my sbt tests due to the usage of ScalaReflectionLock.synchronized in the spark

Re: Serializer not switching

2015-06-22 Thread Josh Rosen
My hunch is that you changed spark.serializer to Kryo but left spark.closureSerializer unmodified, so it's still using Java for closure serialization. Kryo doesn't really work as a closure serializer but there's an open pull request to fix this: https://github.com/apache/spark/pull/6361 On Mon,

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread Josh Rosen
If your job is dying due to out of memory errors in the post-shuffle stage, I'd consider the following approach for implementing de-duplication / distinct(): - Use sortByKey() to perform a full sort of your dataset. - Use mapPartitions() to iterate through each partition of the sorted dataset,

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-13 Thread Josh Rosen
-- *From:* Josh Rosen rosenvi...@gmail.com *To:* Sanjay Subramanian sanjaysubraman...@yahoo.com *Cc:* user@spark.apache.org user@spark.apache.org *Sent:* Friday, June 12, 2015 7:15 AM *Subject:* Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-12 Thread Josh Rosen
Sent from my phone On Jun 11, 2015, at 8:43 AM, Sanjay Subramanian sanjaysubraman...@yahoo.com.INVALID wrote: hey guys Using Hive and Impala daily intensively. Want to transition to spark-sql in CLI mode Currently in my sandbox I am using the Spark (standalone mode) in the CDH

Re: spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-12 Thread Josh Rosen
It sounds like this might be caused by a memory configuration problem. In addition to looking at the executor memory, I'd also bump up the driver memory, since it appears that your shell is running out of memory when collecting a large query result. Sent from my phone On Jun 11, 2015, at

Re: Fully in-memory shuffles

2015-06-10 Thread Josh Rosen
There's a discussion of this at https://github.com/apache/spark/pull/5403 On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet cjno...@gmail.com wrote: Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)?

Re: union and reduceByKey wrong shuffle?

2015-06-02 Thread Josh Rosen
enough to split data into disk. We will work on it to understand and reproduce the problem(not first priority though...) On 1 June 2015 at 23:02, Josh Rosen rosenvi...@gmail.com wrote: How much work is to produce a small standalone reproduction? Can you create an Avro file with some mock

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-02 Thread Josh Rosen
, Deepak On Tue, Jun 2, 2015 at 4:16 AM, Josh Rosen rosenvi...@gmail.com wrote: If you can't run a patched Spark version, then you could also consider using LZF compression instead, since that codec isn't affected by this bug. On Mon, Jun 1, 2015 at 3:32 PM, Andrew Or and...@databricks.com wrote

Re: union and reduceByKey wrong shuffle?

2015-06-01 Thread Josh Rosen
...@gmail.com wrote: Hi We are using spark 1.3.1 Avro-chill (tomorrow will check if its important) we register avro classes from java Avro 1.7.6 On May 31, 2015 22:37, Josh Rosen rosenvi...@gmail.com wrote: Which Spark version are you using? I'd like to understand whether this change could

Re: java.io.IOException: FAILED_TO_UNCOMPRESS(5)

2015-06-01 Thread Josh Rosen
If you can't run a patched Spark version, then you could also consider using LZF compression instead, since that codec isn't affected by this bug. On Mon, Jun 1, 2015 at 3:32 PM, Andrew Or and...@databricks.com wrote: Hi Deepak, This is a notorious bug that is being tracked at

Re: Performance degradation between spark 0.9.3 and 1.3.1

2015-05-22 Thread Josh Rosen
I don't think that 0.9.3 has been released, so I'm assuming that you're running on branch-0.9. There's been over 4000 commits between 0.9.3 and 1.3.1, so I'm afraid that this question doesn't have a concise answer: https://github.com/apache/spark/compare/branch-0.9...v1.3.1 To narrow down the

Re: Does long-lived SparkContext hold on to executor resources?

2015-05-12 Thread Josh Rosen
I would be cautious regarding use of spark.cleaner.ttl, as it can lead to confusing error messages if time-based cleaning deletes resources that are still needed. See my comment at

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-08 Thread Josh Rosen
Do you have any more specific profiling data that you can share? I'm curious to know where AppendOnlyMap.changeValue is being called from. On Fri, May 8, 2015 at 1:26 PM, Michal Haris michal.ha...@visualdna.com wrote: +dev On 6 May 2015 10:45, Michal Haris michal.ha...@visualdna.com wrote:

Python 3 support for PySpark has been merged into master

2015-04-16 Thread Josh Rosen
Hi everyone, We just merged Python 3 support for PySpark into Spark's master branch (which will become Spark 1.4.0). This means that PySpark now supports Python 2.6+, PyPy 2.5+, and Python 3.4+. To run with Python 3, download and build Spark from the master branch then configure the

Re: A problem with Spark 1.3 artifacts

2015-04-06 Thread Josh Rosen
My hunch is that this behavior was introduced by a patch to start shading Jetty in Spark 1.3: https://issues.apache.org/jira/browse/SPARK-3996. Note that Spark's *MetricsSystem* class is marked as *private[spark]* and thus isn't intended to be interacted with directly by users. It's not super

Re: Which OutputCommitter to use for S3?

2015-02-20 Thread Josh Rosen
We (Databricks) use our own DirectOutputCommitter implementation, which is a couple tens of lines of Scala code. The class would almost entirely be a no-op except we took some care to properly handle the _SUCCESS file. On Fri, Feb 20, 2015 at 3:52 PM, Mingyu Kim m...@palantir.com wrote: I

Re: spark-shell has syntax error on windows.

2015-01-23 Thread Josh Rosen
Do you mind filing a JIRA issue for this which includes the actual error message string that you saw? https://issues.apache.org/jira/browse/SPARK On Thu, Jan 22, 2015 at 8:31 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: I am not sure if you get the same exception as I do --

Re: Recent Git Builds Application WebUI Problem and Exception Stating Log directory /tmp/spark-events does not exist.

2015-01-18 Thread Josh Rosen
This looks like a bug in the master branch of Spark, related to some recent changes to EventLoggingListener. You can reproduce this bug on a fresh Spark checkout by running ./bin/spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/tmp/nonexistent-dir where

Re: how to run python app in yarn?

2015-01-14 Thread Josh Rosen
There's an open PR for supporting yarn-cluster mode in PySpark: https://github.com/apache/spark/pull/3976 (currently blocked on reviewer attention / time) On Wed, Jan 14, 2015 at 3:16 PM, Marcelo Vanzin van...@cloudera.com wrote: As the error message says... On Wed, Jan 14, 2015 at 3:14 PM,

Re: Spark Standalone Cluster not correctly configured

2015-01-08 Thread Josh Rosen
Can you please file a JIRA issue for this? This will make it easier to triage this issue. https://issues.apache.org/jira/browse/SPARK Thanks, Josh On Thu, Jan 8, 2015 at 2:34 AM, frodo777 roberto.vaquer...@bitmonlab.com wrote: Hello everyone. With respect to the configuration problem that

Re: Shuffle Problems in 1.2.0

2015-01-04 Thread Josh Rosen
On Tue, Dec 30, 2014 at 12:15 PM, Josh Rosen rosenvi...@gmail.com wrote: Hi Sven, Do you have a small example program that you can share which will allow me to reproduce this issue?  If you have a workload that runs into this, you should be able to keep iteratively simplifying the job and reducing

Re: spark.akka.frameSize limit error

2015-01-04 Thread Josh Rosen
fix. In the meantime, I recommend that you increase your Akka frame size. On Sat, Jan 3, 2015 at 8:51 PM, Saeed Shahrivari saeed.shahriv...@gmail.com wrote: I use the 1.2 version. On Sun, Jan 4, 2015 at 3:01 AM, Josh Rosen rosenvi...@gmail.com wrote: Which version of Spark are you using

Re: Repartition Memory Leak

2015-01-04 Thread Josh Rosen
@Brad, I'm guessing that the additional memory usage is coming from the shuffle performed by coalesce, so that at least explains the memory blowup. On Sun, Jan 4, 2015 at 10:16 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You can try: - Using KryoSerializer - Enabling RDD Compression -

Re: spark.akka.frameSize limit error

2015-01-03 Thread Josh Rosen
Which version of Spark are you using? It seems like the issue here is that the map output statuses are too large to fit in the Akka frame size. This issue has been fixed in Spark 1.2 by using a different encoding for map outputs for jobs with many reducers (

Re: DAG info

2015-01-01 Thread Josh Rosen
This log message is normal; in this case, this message is saying that the final stage needed to compute your job does not have any dependencies / parent stages and that there are no parent stages that need to be computed. On Thu, Jan 1, 2015 at 11:02 PM, shahid sha...@trialx.com wrote: hi guys

Re: NullPointerException

2014-12-31 Thread Josh Rosen
Which version of Spark are you using? On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I get this following Exception when I submit spark application that calculates the frequency of characters in a file. Especially, when I increase the size of data, I

Re: NullPointerException

2014-12-31 Thread Josh Rosen
:04 PM, Josh Rosen rosenvi...@gmail.com wrote: Which version of Spark are you using? On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I get this following Exception when I submit spark application that calculates the frequency of characters in a file

Re: Shuffle Problems in 1.2.0

2014-12-30 Thread Josh Rosen
Hi Sven, Do you have a small example program that you can share which will allow me to reproduce this issue? If you have a workload that runs into this, you should be able to keep iteratively simplifying the job and reducing the data set size until you hit a fairly minimal reproduction (assuming

Re: SparkContext with error from PySpark

2014-12-30 Thread Josh Rosen
To configure the Python executable used by PySpark, see the Using the Shell Python section in the Spark Programming Guide: https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell You can set the PYSPARK_PYTHON environment variable to choose the Python executable that will be

Re: action progress in ipython notebook?

2014-12-29 Thread Josh Rosen
Josh Is there documentation available for status API? I would like to use it. Thanks, Aniket On Sun Dec 28 2014 at 02:37:32 Josh Rosen rosenvi...@gmail.com wrote: The console progress bars are implemented on top of a new stable status API that was added in Spark 1.2. It's possible

Re: action progress in ipython notebook?

2014-12-27 Thread Josh Rosen
The console progress bars are implemented on top of a new stable status API that was added in Spark 1.2. It's possible to query job progress using this interface (in older versions of Spark, you could implement a custom SparkListener and maintain the counts of completed / running / failed tasks /

Re: Discourse: A proposed alternative to the Spark User list

2014-12-25 Thread Josh Rosen
We have a mirror of the user and developer mailing lists on Nabble, but unfortunately this has led to significant usability issues because users may attempt to post messages through Nabble which silently fail to get posted to the actual Apache list and thus are never read by most subscribers:

Re: Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-17 Thread Josh Rosen
will be sent to both spark.incubator.apache.org and spark.apache.org (if that is the case, i'm not sure which alias nabble posts get sent to) would make things a lot more clear. On Sat, Dec 13, 2014 at 5:05 PM, Josh Rosen rosenvi...@gmail.com wrote: I've noticed that several users are attempting to post

Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-13 Thread Josh Rosen
I've noticed that several users are attempting to post messages to Spark's user / dev mailing lists using the Nabble web UI ( http://apache-spark-user-list.1001560.n3.nabble.com/). However, there are many posts in Nabble that are not posted to the Apache lists and are flagged with This post has

Re: java.io.InvalidClassException: org.apache.spark.api.java.JavaUtils$SerializableMapWrapper; no valid constructor

2014-12-01 Thread Josh Rosen
SerializableMapWrapper was added in https://issues.apache.org/jira/browse/SPARK-3926; do you mind opening a new JIRA and linking it to that one? On Mon, Dec 1, 2014 at 12:17 AM, lokeshkumar lok...@dataken.net wrote: The workaround was to wrap the map returned by spark libraries into HashMap

Re: small bug in pyspark

2014-10-12 Thread Josh Rosen
Hi Andy, You may be interested in https://github.com/apache/spark/pull/2651, a recent pull request of mine which cleans up / simplifies the configuration of PySpark's Python executables. For instance, it makes it much easier to control which Python options are passed when launching the PySpark

Re: What if I port Spark from TCP/IP to RDMA?

2014-10-12 Thread Josh Rosen
Hi Theo, Check out *spark-perf*, a suite of performance benchmarks for Spark: https://github.com/databricks/spark-perf. - Josh On Fri, Oct 10, 2014 at 7:27 PM, Theodore Si sjyz...@gmail.com wrote: Hi, Let's say that I managed to port Spark from TCP/IP to RDMA. What tool or benchmark can I

Re: pyspark on python 3

2014-10-03 Thread Josh Rosen
It would be great if we supported Python 3 and I'd be happy to review any pull requests to add it. I don't know that Python 3 is very widely-used, but I'm open to supporting it if it won't require too much work. By the way, we recently added support for PyPy:

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Josh Rosen
If I recall, you should be able to start Hadoop MapReduce using ~/ephemeral-hdfs/sbin/start-mapred.sh. On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com wrote: Hi, I would like to copy log files from s3 to the cluster's ephemeral-hdfs. I tried to use distcp, but I guess

Re: Question on mappartitionwithsplit

2014-08-17 Thread Josh Rosen
Has anyone tried using functools.partial ( https://docs.python.org/2/library/functools.html#functools.partial) with PySpark? If it works, it might be a nice way to address this use-case. On Sun, Aug 17, 2014 at 7:35 PM, Davies Liu dav...@databricks.com wrote: On Sun, Aug 17, 2014 at 11:21 AM,

Re: Broadcasting a set in PySpark

2014-07-18 Thread Josh Rosen
You have to use `myBroadcastVariable.value` to access the broadcasted value; see https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables On Fri, Jul 18, 2014 at 2:56 PM, Vedant Dhandhania ved...@retentionscience.com wrote: Hi All, I am trying to broadcast a set in a

Re: flatten RDD[RDD[T]]

2014-03-02 Thread Josh Rosen
Nope, nested RDDs aren't supported: https://groups.google.com/d/msg/spark-users/_Efj40upvx4/DbHCixW7W7kJ https://groups.google.com/d/msg/spark-users/KC1UJEmUeg8/N_qkTJ3nnxMJ https://groups.google.com/d/msg/spark-users/rkVPXAiCiBk/CORV5jyeZpAJ On Sun, Mar 2, 2014 at 5:37 PM, Cosmin Radoi