[Release Question]: Estimate on 3.5.2 release?

2024-04-26 Thread Paul Gerver
Hello, I'm curious if there is an estimate when 3.5.2 for Spark Core will be released. There are several bug and security vulnerability fixes in the dependencies we are excited to receive! If anyone has any insights, that would be greatly appreciated. Thanks! - ​Paul [cid:8a2e80d5-1a98

CFP for the 2nd Performance Engineering track at Community over Code NA 2023

2023-07-03 Thread Brebner, Paul
ner/> - Paul Brebner and Roger Abelenda

Rename columns without manually setting them all

2023-06-21 Thread John Paul Jayme
Hi, This is currently my column definition : Employee ID NameClient Project Team01/01/2022 02/01/2022 03/01/2022 04/01/2022 05/01/2022 12345 Dummy x Dummy a abc team a OFF WO WH WH WH As you can see, the outer columns are just

How to read excel file in PySpark

2023-06-20 Thread John Paul Jayme
ct has no attribute 'read_excel'. Can you advise? JOHN PAUL JAYME Data Engineer [https://app.tdcx.com/email-signature/assets/img/tdcx-logo.png] m. +639055716384 w. www.tdcx.com<http://www.tdcx.com/> Winner of over 350 Industry Awards [Linkedin]<https://www.linkedin.com/company/tdcxgr

Re: NoClassDefError and SparkSession should only be created and accessed on the driver.

2022-09-20 Thread Paul Rogalinski
Hi Rajat, I have been facing similar problem recently and could solve it by moving the UDF implementation into a dedicated class instead having it implemented in the driver class/object. Regards, Paul. On Tuesday 20 September 2022 10:11:31 (+02:00), rajat kumar wrote: Hi Alton, it's

Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-10-31 Thread Paul Wais
oducer and even couldn't reproduce even > they spent their time. Memory leak issue is not really easy to reproduce, > unless it leaks some objects without any conditions. > > - Jungtaek Lim (HeartSaVioR) > > On Sun, Oct 20, 2019 at 7:18 PM Paul Wais wrote: >> >> Dear

pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-10-20 Thread Paul Wais
ere very different jobs, but perhaps this issue is bespoke to local mode? Emphasis: I did try to del the pyspark objects and run python GC. That didn't help at all. pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image) 12-core i7 with 16GB of ram and 22GB swap

Avro support broken?

2019-07-04 Thread Paul Wais
-tabpanel#comment-16878896 Cheers, -Paul - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

dropping unused data from a stream

2019-01-22 Thread Paul Tremblay
I will be streaming data and am trying to understand how to get rid of old data from a stream so it does not become to large. I will stream in one large table of buying data and join that to another table of different data. I need the last 14 days from the second table. I will not need data that

Re: Problem in persisting file in S3 using Spark: xxx file does not exist Exception

2018-05-02 Thread Paul Tremblay
I would like to see the full error. However, S3 can give misleading messages if you don't have the correct permissions. On Tue, Apr 24, 2018, 2:28 PM Marco Mistroni wrote: > HI all > i am using the following code for persisting data into S3 (aws keys are > already stored

[Spark scheduling] Spark schedules single task although rdd has 48 partitions?

2018-05-02 Thread Paul Borgmans
(please notice this question was previously posted to https://stackoverflow.com/questions/49943655/spark-schedules-single-task-although-rdd-has-48-partitions) We are running Spark 2.3 / Python 3.5.2. For a job we run following code (please notice that the input txt files are just a simplified

History server and non-HDFS filesystems

2017-11-17 Thread Paul Mackles
now ADL is not heavily used at this time so I wonder if anyone is seeing this with S3 as well? Maybe not since S3 permissions are always reported as world-readable (I think) which causes checkAccessPermission() to succeed. Any thoughts or feedback appreciated. -- Thanks, Paul

Spark REST API

2017-11-07 Thread Paul Corley
these are currently streaming apps running on EMR. Paul Corley | Principle Data Engineer IgnitionOne | Marketing Technology. Simplified. Office: 1545 Peachtree St NE | Suite 500 | Atlanta, GA | 30309 Direct: 702.336.0094 Email: paul.cor...@ignitionone.com<mailto:paul.cor...@ignitionone.com>

Re: Running spark examples in Intellij

2017-10-11 Thread Paul
You say you did the maven package but did you do a maven install and define your local maven repo in SBT? -Paul Sent from my iPhone > On Oct 11, 2017, at 5:48 PM, Stephen Boesch <java...@gmail.com> wrote: > > When attempting to run any example program w/ Intellij I am runnin

Re: is it ok to have multiple sparksession's in one spark structured streaming app?

2017-09-08 Thread Paul
You would set the Kafka topic as your data source and you would write a custom output to Cassandra everything would be or could be contained within your stream -Paul Sent from my iPhone > On Sep 8, 2017, at 2:52 PM, kant kodali <kanth...@gmail.com> wrote: > > How can I use o

Structured Streaming from Parquet

2017-05-25 Thread Paul Corley
ws a java OOM error. Additionally each cycle through this step takes successively longer. Hopefully someone can lend some insight as to what is actually taking place in this step and how to alleviate it Thanks, Paul Corley | Principle Data Engineer

splitting a huge file

2017-04-21 Thread Paul Tremblay
to be split up, right? We ended up using a single machine with a single thread to do the splitting. I just want to make sure I am not missing something obvious. Thanks! -- Paul Henry Tremblay Attunix

small job runs out of memory using wholeTextFiles

2017-04-07 Thread Paul Tremblay
the number of partitions, but get the same error each time. In contrast, if I run a simple: rdd = sc.textFile("s3://paulhtremblay/noaa_tmp/") rdd.coutn() The job finishes in 15 minutes, even with just 3 nodes. Thanks -- Paul Henry Tremblay Robert Half Technology

Re: bug with PYTHONHASHSEED

2017-04-05 Thread Paul Tremblay
r 4, 2017 at 7:49 AM Eike von Seggern <eike.seggern@seven >> cal.com> wrote: >> >> 2017-04-01 21:54 GMT+02:00 Paul Tremblay <paulhtremb...@gmail.com>: >> >> When I try to to do a groupByKey() in my spark environment, I get the >> error described her

Re: bug with PYTHONHASHSEED

2017-04-04 Thread Paul Tremblay
So that means I have to pass that bash variable to the EMR clusters when I spin them up, not afterwards. I'll give that a go. Thanks! Henry On Tue, Apr 4, 2017 at 7:49 AM, Eike von Seggern <eike.segg...@sevenval.com> wrote: > 2017-04-01 21:54 GMT+02:00 Paul Tremblay <paulhtremb.

Re: Alternatives for dataframe collectAsList()

2017-04-03 Thread Paul Tremblay
> View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Alternatives-for-dataframe- > collectAsList-tp28547.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > -----

Re: Read file and represent rows as Vectors

2017-04-03 Thread Paul Tremblay
ailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Paul Henry Tremblay Robert Half Technology

Re: Looking at EMR Logs

2017-04-02 Thread Paul Tremblay
for spark logs) and run the history server like: > ``` > cd /usr/local/src/spark-1.6.1-bin-hadoop2.6 > sbin/start-history-server.sh > ``` > and then open http://localhost:18080 > > > > > On Thu, Mar 30, 2017 at 8:45 PM, Paul Tremblay <paulhtremb...@gmail.com> > wrot

bug with PYTHONHASHSEED

2017-04-01 Thread Paul Tremblay
I get the same error: Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED Anyone know how to fix this problem in python 3.4? Thanks Henry -- Paul Henry Tremblay Robert Half Technology

pyspark bug with PYTHONHASHSEED

2017-04-01 Thread Paul Tremblay
I get the same error: Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED Anyone know how to fix this problem in python 3.4? Thanks Henry -- Paul Henry Tremblay Robert Half Technology

Looking at EMR Logs

2017-03-30 Thread Paul Tremblay
to evaluate such things as how many tasks were completed, how many executors were used, etc. I currently save my logs to S3. Thanks! Henry -- Paul Henry Tremblay Robert Half Technology

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Paul Tremblay
://michaelryanbell.com/processing-whole-files-spark-s3.html Jon On Mon, Feb 6, 2017 at 6:38 PM, Paul Tremblay <paulhtremb...@gmail.com <mailto:paulhtremb...@gmail.com>> wrote: I've actually been able to trace the problem to the files being read in. If I change to a differe

Re: Turning rows into columns

2017-02-11 Thread Paul Tremblay
On Feb 4, 2017 16:25, "Paul Tremblay" <paulhtremb...@gmail.com <mailto:paulhtremb...@gmail.com>> wrote: I am using pyspark 2.1 and am wondering how to convert a flat file, with one record per row, into a columnar format. Here is an example of the data: u'

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Paul Tremblay
I've actually been able to trace the problem to the files being read in. If I change to a different directory, then I don't get the error. Is one of the executors running out of memory? On 02/06/2017 02:35 PM, Paul Tremblay wrote: When I try to create an rdd using wholeTextFiles, I get

wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Paul Tremblay
When I try to create an rdd using wholeTextFiles, I get an incomprehensible error. But when I use the same path with sc.textFile, I get no error. I am using pyspark with spark 2.1. in_path = 's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/ rdd =

Turning rows into columns

2017-02-04 Thread Paul Tremblay
nks Henry -- Paul Henry Tremblay Robert Half Technology

RE: spark 2.02 error when writing to s3

2017-01-27 Thread VND Tremblay, Paul
Not sure what you mean by "a consistency layer on top." Any explanation would be greatly appreciated! Paul _ Paul Tremblay Analytics Specialist THE BOSTON CONSULTING GROUP Tel.

RE: spark 2.02 error when writing to s3

2017-01-26 Thread VND Tremblay, Paul
This seems to have done the trick, although I am not positive. If I have time, I'll test spinning up a cluster with and without consistent view to pin point the error. _ Paul Tremblay Analytics

RE: Ingesting Large csv File to relational database

2017-01-26 Thread VND Tremblay, Paul
. _ Paul Tremblay Analytics Specialist THE BOSTON CONSULTING GROUP Tel. + ▪ Mobile + _ From: Eric Dain [mailto:ericdai...@gmail.com] Sent: Wednesday, January 25, 2017 11:14 PM

RE: spark 2.02 error when writing to s3

2017-01-20 Thread VND Tremblay, Paul
I am using an EMR cluster, and the latest version offered is 2.02. The link below indicates that that user had the same problem, which seems unresolved. Thanks Paul _ Paul Tremblay Analytics

spark 2.02 error when writing to s3

2017-01-19 Thread VND Tremblay, Paul
iple times and causes the error. The suggestion is to turn off speculation, but I believe speculation is turned off by default in pyspark. Thanks! Paul _ Paul Tremblay Analytics Specialist THE BOSTON

Spark 2.0 Encoder().schema() is sorting StructFields

2016-10-12 Thread Paul Stewart
this be considered a bug/enhancement? Regards, Paul - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: AVRO vs Parquet

2016-03-04 Thread Paul Leclercq
gt;>> tools like Hive, Impala, HAWQ. >>>> >>>> Suggestions? >>>> — >>>> airis.DATA >>>> Timothy Spann, Senior Solutions Architect >>>> C: 609-250-5894 >>>> http://airisdata.com/ >>>> http://meetup.com/nj-datascience >>>> >>>> >>>> >>> >>> >>> -- >>> Donald Drake >>> Drake Consulting >>> http://www.drakeconsulting.com/ >>> https://twitter.com/dondrake <http://www.MailLaunder.com/> >>> 800-733-2143 >>> >> >> > -- Paul Leclercq | Data engineer paul.lecle...@tabmo.io | http://www.tabmo.fr/

Re: Kafka streaming receiver approach - new topic not read from beginning

2016-02-23 Thread Paul Leclercq
}/{partitionId} {newOffset} Source : https://metabroadcast.com/blog/resetting-kafka-offsets 2016-02-22 11:55 GMT+01:00 Paul Leclercq <paul.lecle...@tabmo.io>: > Thanks for your quick answer. > > If I set "auto.offset.reset" to "smallest" as for KafkaParams like th

Re: Kafka streaming receiver approach - new topic not read from beginning

2016-02-22 Thread Paul Leclercq
guration "auto.offset.reset" through parameter > "kafkaParams" which is provided in some other overloaded APIs of > createStream. > > By default Kafka will pick data from latest offset unless you explicitly > set it, this is the behavior Kafka, not Spark. > >

Kafka streaming receiver approach - new topic not read from beginning

2016-02-22 Thread Paul Leclercq
offset.reset > to "earliest" for the new consumer in 0.9 and "smallest" for the old > consumer. https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whydoesmyconsumernevergetanydata? Thanks -- Paul Leclercq

Re: spark-1.2.0--standalone-ha-zookeeper

2016-01-20 Thread Paul Leclercq
k.deploy.zookeeper.url="ZOOKEEPER_IP:2181" > -Dspark.deploy.zookeeper.dir="/spark"' A good thing to check if everything went OK is the folder /spark on the ZooKeeper server. I could not find it on my server. Thanks for reading, Paul 2016-01-19 22:12 GMT+01:00 Raghvendra Singh <

Re: Spark streaming job hangs

2015-12-01 Thread Paul Leclercq
- Added > jobs for time 144894989 ms > 2015-12-01 06:04:55,064 [JobGenerator] INFO (Logging.scala:59) - Added > jobs for time 1448949895000 ms > 2015-12-01 06:05:00,125 [JobGenerator] INFO (Logging.scala:59) - Added > jobs for time 144894990 ms > > > Thanks > LCassa > -- Paul Leclercq | Data engineer paul.lecle...@tabmo.io | http://www.tabmo.fr/

unpersist RDD from another thread

2015-09-16 Thread Paul Weiss
has been called? thanks, -paul

Re: unpersist RDD from another thread

2015-09-16 Thread Paul Weiss
ould fail), but > the performance will be unpredictable (some partition may use cache, some > may not be able to use the cache). > > On Wed, Sep 16, 2015 at 1:06 PM, Paul Weiss <paulweiss@gmail.com> > wrote: > >> Hi, >> >> What is the behavior when calli

RE: Too many open files

2015-07-29 Thread Paul Röwer
Maybe you forgot Tod close a reader Ort writer object. Am 29. Juli 2015 18:04:59 MESZ, schrieb saif.a.ell...@wellsfargo.com: Thank you both, I will take a look, but 1. For high-shuffle tasks, is this right for the system to have the size and thresholds high? I hope there is no bad

Jobs with unknown origin.

2015-07-08 Thread Jan-Paul Bultmann
Hey, I have quite a few jobs appearing in the web-ui with the description run at ThreadPoolExecutor.java:1142. Are these generated by SparkSQL internally? There are so many that they cause a RejectedExecutionException when the thread-pool runs out of space for them. RejectedExecutionException

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Jan-Paul Bultmann
I would guess the opposite is true for highly iterative benchmarks (common in graph processing and data-science). Spark has a pretty large overhead per iteration, more optimisations and planning only makes this worse. Sure people implemented things like dijkstra's algorithm in spark (a problem

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Jan-Paul Bultmann
Sorry, that should be shortest path, and diameter of the graph. I shouldn't write emails before I get my morning coffee... On 06 Jul 2015, at 09:09, Jan-Paul Bultmann janpaulbultm...@me.com wrote: I would guess the opposite is true for highly iterative benchmarks (common in graph processing

generateTreeString causes huge performance problems on dataframe persistence

2015-06-17 Thread Jan-Paul Bultmann
org.apache.spark.sql.DataFrame.persist(StorageLevel) DataFrame.scala:1320 ^ | Application logic. | Could someone confirm my suspicion? And does somebody know why it’s called while caching, and why it walks the entire tree including cached results? Cheers, Jan-Paul

Re: generateTreeString causes huge performance problems on dataframe persistence

2015-06-17 Thread Jan-Paul Bultmann
iterations due to the problem :). As a workaround, you can break the iterations into smaller ones and trigger them manually in sequence. You mean` write` ing them to disk after each iteration? Thanks :), Jan -Original Message- From: Jan-Paul Bultmann [mailto:janpaulbultm...@me.com

Re: build jar with all dependencies

2015-06-02 Thread Paul Röwer
(SparkContext.scala:203) at org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53) at mgm.tp.bigdata.ma_spark.SparkMain.main(SparkMain.java:38) what i do wrong? best regards, paul

Soft distinct on data frames.

2015-05-28 Thread Jan-Paul Bultmann
Hey, Is there a way to do a distinct operation on each partition only? My program generates quite a few duplicate tuples and it would be nice to remove some of these as an optimisation without having to reshuffle the data. I’ve also noticed that plans generated with an unique transformation have

spark sql, creating literal columns in java.

2015-05-05 Thread Jan-Paul Bultmann
Hey, What is the recommended way to create literal columns in java? Scala has the `lit` function from `org.apache.spark.sql.functions`. Should it be called from java as well? Cheers jan - To unsubscribe, e-mail:

Re: Jackson-core-asl conflict with Spark

2015-03-12 Thread Paul Brown
So... one solution would be to use a non-Jurassic version of Jackson. 2.6 will drop before too long, and 3.0 is in longer-term planning. The 1.x series is long deprecated. If you're genuinely stuck with something ancient, then you need to include the JAR that contains the class, and 1.9.13 does

Perf impact of BlockManager byte[] copies

2015-02-27 Thread Paul Wais
is possible and leverage it. Cheers, -Paul - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Support for SQL on unions of tables (merge tables?)

2015-01-21 Thread Paul Wais
/15 9:51 PM, Paul Wais wrote: Dear List, What are common approaches for addressing over a union of tables / RDDs? E.g. suppose I have a collection of log files in HDFS, one log file per day, and I want to compute the sum of some field over a date range in SQL. Using log schema, I can read

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Paul Wais
To force one instance per executor, you could explicitly subclass FlatMapFunction and have it lazy-create your parser in the subclass constructor. You might also want to try RDD#mapPartitions() (instead of RDD#flatMap() if you want one instance per partition. This approach worked well for me

Support for SQL on unions of tables (merge tables?)

2015-01-11 Thread Paul Wais
question: are there plans to use Parquet Index Pages to make Spark SQL faster? E.g. log indices over date ranges would be relevant here. All the best, -Paul

Re: Downloads from S3 exceedingly slow when running on spark-ec2

2014-12-20 Thread Paul Brown
I would suggest checking out disk IO on the nodes in your cluster and then reading up on the limiting behaviors that accompany different kinds of EC2 storage. Depending on how things are configured for your nodes, you may have a local storage configuration that provides bursty IOPS where you get

Using S3 block file system

2014-12-09 Thread Paul Colomiets
how to do it. I use spark 1.2.0rc1 with hadoop 2.4 and Riak CS (instead of S3) if that matters. The s3n:// protocol with same settings work. Thanks. -- Paul - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: Parsing a large XML file using Spark

2014-11-21 Thread Paul Brown
Unfortunately, unless you impose restrictions on the XML file (e.g., where namespaces are declared, whether entity replacement is used, etc.), you really can't parse only a piece of it even if you have start/end elements grouped together. If you want to deal effectively (and scalably) with large

Re: Native / C/C++ code integration

2014-11-11 Thread Paul Wais
More thoughts. I took a deeper look at BlockManager, RDD, and friends. Suppose one wanted to get native code access to un-deserialized blocks. This task looks very hard. An RDD behaves much like a Scala iterator of deserialized values, and interop with BlockManager is all on deserialized data.

Native / C/C++ code integration

2014-11-07 Thread Paul Wais
. Is there a way to expose raw, in-memory partition/block data to native code? Has anybody else attacked this problem a different way? All the best, -Paul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Native-C-C-code-integration-tp18347.html Sent from

[SQL] PERCENTILE is not working

2014-11-05 Thread Kevin Paul
Paul - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Do Spark executors restrict native heap vs JVM heap?

2014-11-02 Thread Paul Wais
taking memory. On Oct 30, 2014 6:43 PM, Paul Wais pw...@yelp.com javascript:; wrote: Dear Spark List, I have a Spark app that runs native code inside map functions. I've noticed that the native code sometimes sets errno to ENOMEM indicating a lack of available memory. However, I've

SchemaRDD.where clause error

2014-10-21 Thread Kevin Paul
Hi all, I tried to use the function SchemaRDD.where() but got some error: val people = sqlCtx.sql(select * from people) people.where('age === 10) console:27: error: value === is not a member of Symbol where did I go wrong? Thanks, Kevin Paul

Re: SparkSQL on Hive error

2014-10-13 Thread Kevin Paul
Thanks Michael, your patch works for me :) Regards, Kelvin Paul On Fri, Oct 3, 2014 at 3:52 PM, Michael Armbrust mich...@databricks.com wrote: Are you running master? There was briefly a regression here that is hopefully fixed by spark#2635 https://github.com/apache/spark/pull/2635. On Fri

Setting SparkSQL configuration

2014-10-13 Thread Kevin Paul
the config using HiveContext's setConf function? Regards, Kelvin Paul

Re: Any issues with repartition?

2014-10-08 Thread Paul Wais
Looks like an OOM issue? Have you tried persisting your RDDs to allow disk writes? I've seen a lot of similar crashes in a Spark app that reads from HDFS and does joins. I.e. I've seen java.io.IOException: Filesystem closed, Executor lost, FetchFailed, etc etc with non-deterministic crashes.

SparkSQL on Hive error

2014-10-03 Thread Kevin Paul
) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) Thanks, Kelvin Paul

Worker Random Port

2014-09-23 Thread Paul Magid
in the spark-env.sh but it does not seem to stop the dynamic port behavior. I have included the startup output when running spark-shell from the edge server in a different dmz and then from a node in the cluster. Any help greatly appreciated. Paul Magid Toyota Motor Sales IS Enterprise

Re: Unable to find proto buffer class error with RDDprotobuf

2014-09-19 Thread Paul Wais
Well it looks like this is indeed a protobuf issue. Poked a little more with Kryo. Since protobuf messages are serializable, I tried just making Kryo use the JavaSerializer for my messages. The resulting stack trace made it look like protobuf GeneratedMessageLite is actually using the

Re: Unable to find proto buffer class error with RDDprotobuf

2014-09-19 Thread Paul Wais
Derp, one caveat to my solution: I guess Spark doesn't use Kryo for Function serde :( On Fri, Sep 19, 2014 at 12:44 AM, Paul Wais pw...@yelp.com wrote: Well it looks like this is indeed a protobuf issue. Poked a little more with Kryo. Since protobuf messages are serializable, I tried just

Unable to find proto buffer class error with RDDprotobuf

2014-09-18 Thread Paul Wais
:7077 ) ? I've tried poking through the shell scripts and SparkSubmit.scala and unfortunately I haven't been able to grok exactly what Spark is doing with the remote/local JVMs. Cheers, -Paul - To unsubscribe, e-mail: user-unsubscr

Spark SQL Exception

2014-09-18 Thread Paul Magid
, is there a document that lists current Spark SQL limitations/issues? Paul Magid Toyota Motor Sales IS Enterprise Architecture (EA) Architect I RD Ph: 310-468-9091 (X69091) PCN 1C2970, Mail Drop PN12 Successful Result In Impala

RE: Spark SQL Exception

2014-09-18 Thread Paul Magid
identical keys in the input tuples.) SPARK-2926 Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle The Exception is included below. Paul Magid Toyota Motor Sales IS Enterprise Architecture (EA) Architect I RD Ph: 310-468-9091 (X69091) PCN 1C2970, Mail Drop PN12 Exception

Re: Unable to find proto buffer class error with RDDprotobuf

2014-09-18 Thread Paul Wais
/hadoop-project/pom.xml On Thu, Sep 18, 2014 at 1:06 AM, Paul Wais pw...@yelp.com wrote: Dear List, I'm writing an application where I have RDDs of protobuf messages. When I run the app via bin/spar-submit with --master local --driver-class-path path/to/my/uber.jar, Spark is able to ser

Re: Unable to find proto buffer class error with RDDprotobuf

2014-09-18 Thread Paul Wais
* https://github.com/apache/spark/pull/181 * http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3c7f6aa9e820f55d4a96946a87e086ef4a4bcdf...@eagh-erfpmbx41.erf.thomson.com%3E * https://groups.google.com/forum/#!topic/spark-users/Q66UOeA2u-I On Thu, Sep 18, 2014 at 4:51 PM, Paul Wais pw

Re: Unable to find proto buffer class error with RDDprotobuf

2014-09-18 Thread Paul Wais
hmm would using kyro help me here? On Thursday, September 18, 2014, Paul Wais pw...@yelp.com wrote: Ah, can one NOT create an RDD of any arbitrary Serializable type? It looks like I might be getting bitten by the same java.io.ObjectInputStream uses root class loader only bugs mentioned

Re: Unable to find proto buffer class error with RDDprotobuf

2014-09-18 Thread Paul Wais
/2f9b2bd7844ee8393dc9c319f4fefedf95f5e460/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L74 If uber.jar is on the classpath, then the root classloader would have the code, hence why --driver-class-path fixes the bug. On Thu, Sep 18, 2014 at 5:42 PM, Paul Wais pw...@yelp.com wrote

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais
mvn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package and hadoop 2.3 / cdh5 from http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.0.tar.gz On Mon, Sep 15, 2014 at 6:49 PM, Christian Chua cc8...@icloud.com wrote: Hi Paul. I would recommend building your own 1.1.0

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais
.cloudfront.net/spark-1.1.0-bin-hadoop2.3.tgz pom.xml snippets: https://gist.github.com/ypwais/ff188611d4806aa05ed9 [1] http://stackoverflow.com/questions/24747037/how-to-define-a-dependency-scope-in-maven-to-include-a-library-in-compile-run Thanks everybody!! -Paul On Tue, Sep 16, 2014 at 3:55 AM

Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-15 Thread Paul Wais
? Are there distros of Spark 1.1 and hadoop that should work together out-of-the-box? (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..) Thanks for any help anybody can give me here! -Paul - To unsubscribe, e-mail: user

Re: increase parallelism of reading from hdfs

2014-08-11 Thread Paul Hamilton
a NewHadoopRDD. I am sure there is some way to use it with convenience methods like SparkContext.textFile, you could probably set the system property mapreduce.input.fileinputformat.split.maxsize. Regards, Paul Hamilton From: Chen Song chen.song...@gmail.com Date: Friday, August 8, 2014 at 9:13

Re: How to read a multipart s3 file?

2014-08-07 Thread paul
any file larger than 256,000,000 bytes is split. If you don't explicitly set it the limit is infinite which leads to the behavior you are seeing where it is 1 split per file. Regards, Paul Hamilton -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read

Re: Release date for new pyspark

2014-07-17 Thread Paul Wais
in the second half of next month (or shortly thereafter). On Wed, Jul 16, 2014 at 4:03 PM, Paul Wais pw...@yelp.com wrote: Dear List, The version of pyspark on master has a lot of nice new features, e.g. SequenceFile reading, pickle i/o, etc: https://github.com/apache/spark/blob/master

Release date for new pyspark

2014-07-16 Thread Paul Wais
, -Paul Wais

Re: Recommended pipeline automation tool? Oozie?

2014-07-10 Thread Paul Brown
We use Luigi for this purpose. (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.) — p...@mult.ifario.us |

Re: jackson-core-asl jar (1.8.8 vs 1.9.x) conflict with the spark-sql (version 1.x)

2014-06-28 Thread Paul Brown
Hi, Mans -- Both of those versions of Jackson are pretty ancient. Do you know which of the Spark dependencies is pulling them in? It would be good for us (the Jackson, Woodstox, etc., folks) to see if we can get people to upgrade to more recent versions of Jackson. -- Paul — p

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-25 Thread Paul Brown
Hi, Robert -- I wonder if this is an instance of SPARK-2075: https://issues.apache.org/jira/browse/SPARK-2075 -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Wed, Jun 25, 2014 at 6:28 AM, Robert James srobertja...@gmail.com wrote: On 6/24/14, Robert James

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-08 Thread Paul Brown
/browse/SPARK-2075. Cheers. -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Jun 6, 2014 at 2:45 AM, HenriV henri.vanh...@vdab.be wrote: I'm experiencing the same error while upgrading from 0.9.1 to 1.0.0. Im using google compute engine and cloud storage

Unexpected results when caching data

2014-05-12 Thread paul
: 2014050917: 7 2014050918: 12 Any idea what could account for the differences? BTW I am using Spark 0.9.1. Thanks, Paul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unexpected-results-when-caching-data-tp5619.html Sent from the Apache Spark User List mailing

Re: missing method in my slf4j after excluding Spark ZK log4j

2014-05-12 Thread Paul Brown
Hi, Adrian -- If my memory serves, you need 1.7.7 of the various slf4j modules to avoid that issue. Best. -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Mon, May 12, 2014 at 7:51 AM, Adrian Mocanu amoc...@verticalscope.comwrote: Hey guys, I've asked before

CDH 5.0 and Spark 0.9.0

2014-04-30 Thread Paul Schooss
Hello, So I was unable to run the following commands from the spark shell with CDH 5.0 and spark 0.9.0, see below. Once I removed the property property nameio.compression.codec.lzo.class/name valuecom.hadoop.compression.lzo.LzoCodec/value finaltrue/final /property from the core-site.xml on the

Can't run a simple spark application with 0.9.1

2014-04-15 Thread Paul Schooss
Hello, Currently I deployed 0.9.1 spark using a new way of starting up spark exec start-stop-daemon --start --pidfile /var/run/spark.pid --make-pidfile --chuid ${SPARK_USER}:${SPARK_GROUP} --chdir ${SPARK_HOME} --exec /usr/bin/java -- -cp ${CLASSPATH}

Re: Can't run a simple spark application with 0.9.1

2014-04-15 Thread Paul Schooss
I am a dork please disregard this issue. I did not have the slaves correctly configured. This error is very misleading On Tue, Apr 15, 2014 at 11:21 AM, Paul Schooss paulmscho...@gmail.comwrote: Hello, Currently I deployed 0.9.1 spark using a new way of starting up spark exec start

JMX with Spark

2014-04-15 Thread Paul Schooss
Has anyone got this working? I have enabled the properties for it in the metrics.conf file and ensure that it is placed under spark's home directory. Any ideas why I don't see spark beans ?

Shutdown with streaming driver running in cluster broke master web UI permanently

2014-04-11 Thread Paul Mogren
I had a cluster running with a streaming driver deployed into it. I shut down the cluster using sbin/stop-all.sh. Upon restarting (and restarting, and restarting), the master web UI cannot respond to requests. The cluster seems to be otherwise functional. Below is the master's log, showing

CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Paul Mogren
Hello, Spark community! My name is Paul. I am a Spark newbie, evaluating version 0.9.0 without any Hadoop at all, and need some help. I run into the following error with the StatefulNetworkWordCount example (and similarly in my prototype app, when I use the updateStateByKey operation). I get

  1   2   >