Spark History CORS header ‘Access-Control-Allow-Origin’ missing

2024-08-06 Thread Thomas Mauran
Hello, I'm trying to acces Spark History UI through Apache Knox proxy but I get this error the following error: Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at [ https://ithdpdev-ekleszcz01.cern.ch:18080/api/v1/applications?limit=2147483647&status

Help wanted on securing spark with Apache Knox / JWT

2024-07-11 Thread Thomas Mauran
Hello, I am sending this email to the mailing list, to get your help on a problem that I can't seem to resolve myself. I am trying to secure Spark history ui running with Yarn as master using Apache Knox. >From the Knox configuration point of view I managed to secure the Spark >service, i

Use Spark Aggregator in PySpark

2023-04-23 Thread Thomas Wang
I use the Spark Aggregator defined in a jar file in PySpark? Thanks. Thomas

Re: Spark Aggregator with ARRAY input and ARRAY output

2023-04-23 Thread Thomas Wang
Thanks Raghavendra, Could you be more specific about how I can use ExpressionEncoder()? More specifically, how can I conform to the return type of Encoder>? Thomas On Sun, Apr 23, 2023 at 9:42 AM Raghavendra Ganesh wrote: > For simple array types setting encoder to ExpressionEncoder()

Spark Aggregator with ARRAY input and ARRAY output

2023-04-23 Thread Thomas Wang
utEncoder() { return null; } } The part I'm not quite sure is how to override bufferEncoder and outputEncoder. The default Encoders list does not provide encoding for List. Can someone point me to the right direction? Thanks! Thomas

eqNullSafe breaks Sorted Merge Bucket Join?

2023-03-09 Thread Thomas Wang
.., PartitionFilters: [], PushedFilters: [], ReadSchema: struct If I read this correctly, the eqNullSafe is just a syntactic sugar that automatically applies a COALESCE to 0? Does Spark consider potential key collisions in this case (e.g. I have a user_id = 0 in my original dataset)? I know if we apply a UDF on the join condition, it would break the bucketing, thus the rebucketing and resorting. However, I'm wondering in this special case, can we make it work as well? Thanks. Thomas

Spark with External Shuffle Service - using saved shuffle files in the event of executor failure

2021-05-12 Thread Chris Thomas
Hi, I am pretty confident I have observed Spark configured with the Shuffle Service continuing to fetch shuffle files on a node in the event of executor failure, rather than recompute the shuffle files as happens without the Shuffle Service. Can anyone confirm this? (I have a SO question 

Re: 退订

2021-03-05 Thread Thomas
please send an empty email to: user-unsubscr...@spark.apache.org for unsubscribing yourself from the list. Thanks On Fri, Mar 5, 2021, at 9:21 PM, 韩天罡 wrote: > 哈哈哈,兄弟你退订成功了吗 > > 在 2021-03-05 15:08:35,"吃完药感觉自己萌萌哒" <1356469...@qq.com> 写道: >> 退订 > > > 【上班族早餐后备之选】网

Fwd: Spark API and immutability

2020-05-25 Thread Chris Thomas
The cache() method on the DataFrame API caught me out. Having learnt that DataFrames are built on RDDs and that RDDs are immutable, when I saw the statement df.cache() in our codebase I thought ‘This must be a bug, the result is not assigned, the statement will have no affect.’ However, I’ve sinc

[PySpark] TypeError: expected string or bytes-like object

2019-03-05 Thread Thomas Ryck
I am using PySpark through JupyterLab using the Spark distibution provided with *conda install pyspark*. So I run Spark locally. I started using pyspark 2.4.0 but I had a Socket issue which I solved with downgrading the package to 2.3.2. So I am using pyspark 2.3.2 at the moment. I am trying to

Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Not authorized to access group: spark-kafka-source-060f3ceb-09f4-4e28-8210-3ef8a845fc92--2038748645-driver-2

2019-02-12 Thread Allu👌🏽 Thomas
onException: Not authorized to access group: spark-kafka-source-060f3ceb-09f4-4e28-8210-3ef8a845fc92--2038748645-driver-2 Thanks, Thomas Thomas

Spark In Memory Shuffle

2018-10-17 Thread thomas lavocat
t state of in memory shuffling ? Is it implemented in production ? Does the current shuffle still use disks to work ? Is it possible to somehow do it in RAM only ? Regards, Thomas - To unsubscribe e-mail: use

Re: [Spark Streaming MEMORY_ONLY] Understanding Dataflow

2018-07-04 Thread Thomas Lavocat
ture, tasks/stages are > defined to perform which may result in shuffle. If I understand correctly : * Only shuffle data goes through the driver * The receivers data stays node local until a shuffle occurs Is that right ? > On Wed, Jul 4, 2018 at 1:56 PM, thomas lavocat < > thomas.lav

[Spark Streaming MEMORY_ONLY] Understanding Dataflow

2018-07-04 Thread thomas lavocat
exchange data themselves, in a shuffle, data also transit to the driver. Is that correct ? Thomas - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-11 Thread thomas lavocat
ish of previous batch, if you set "spark.streaming.concurrentJobs" larger than 1, then the current batch could start without waiting for the previous batch (if it is delayed), which will lead to unexpected results. thomas lavocat <mailto:thomas.lavo...@univ-grenoble-alpes.fr>

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread thomas lavocat
e not independent. What do you mean exactly by not independent ? Are several source joined together dependent ? Thanks, Thomas thomas lavocat <mailto:thomas.lavo...@univ-grenoble-alpes.fr>> 于2018年6月5日周二 下午7:17写道: Hello, Thank's for your answer. On 05/06/2018 11:24

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread thomas lavocat
h this property. But I'm experiencing scalability issues. With more than 16 receivers spread over 8 executors, my executors no longer receive work from the driver and fall idle. Is there an explanation ? Thanks, Thomas

[Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread thomas lavocat
Hi everyone, I'm wondering if the property  spark.streaming.concurrentJobs should reflects the total number of possible concurrent task on the cluster, or the a local number of concurrent tasks on one compute node. Thanks for your help. T

CMR: An open-source Data acquisition API for Spark is available

2018-05-23 Thread Thomas Fuller
Hi Folks, Today I've released my open-source CMR API, which is used to acquire data from several data providers directly in Spark. Currently the CMR API offers integration with the following: - Federal Reserve Bank of St. Louis - World Bank - TreasuryDirect.gov - OpenFIGI.com *Of note*: - The

Understand task timing

2018-02-19 Thread Thomas Decaux
es to read 8 M from HDFS » ?  Thomas Decaux

[Spark Core] Limit the task duration (and kill it!)

2018-02-02 Thread Thomas Decaux
Hello, I would like to limit task duration to prevent big task such as « SELECT * FROM toto » , or limit the CPU-time, then kill the task/job. Is that possible ? (A kind of watch dog) Many thanks, Thomas Decaux

Re: How do I save the dataframe data as a pdf file?

2017-12-12 Thread Anthony Thomas
I want to create > pdf files. > I am using scala so hoping to find a easier solution in scala, if not I > will try out your suggestion . > > > On Tue, Dec 12, 2017 at 11:29 AM, Anthony Thomas > wrote: > >> Are you trying to produce a formatted table in a pdf file where

Re: How do I save the dataframe data as a pdf file?

2017-12-12 Thread Anthony Thomas
Are you trying to produce a formatted table in a pdf file where the numbers in the table come from a dataframe? I.e. to present summary statistics or other aggregates? If so I would guess your best bet would be to collect the dataframe as a Pandas dataframe and use the to_latex method. You can then

Re: Spark streaming for CEP

2017-10-24 Thread Thomas Bailet
Hi we (@ hurence) have released on open source middleware based on SparkStreaming over Kafka to do CEP and log mining, called *logisland* (https://github.com/Hurence/logisland/) it has been deployed into production for 2 years now and does a great job. You should have a look. bye Thomas

Crash in Unit Tests

2017-09-29 Thread Anthony Thomas
Hi Spark Users, I recently compiled spark 2.2.0 from source on an EC2 m4.2xlarge instance (8 cores, 32G RAM) running Ubuntu 14.04. I'm using Oracle Java 1.8. I compiled using the command: export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" ./build/mvn -DskipTests -Pnetlib-lgpl clean package

Re: [MLLib]: Executor OutOfMemory in BlockMatrix Multiplication

2017-06-14 Thread Anthony Thomas
esting 3n + logn > matrix blocks in memory at once, which is why it sucks so much. > > Sent from my iPhone > > On Jun 14, 2017, at 7:07 PM, Anthony Thomas wrote: > > I've been experimenting with MlLib's BlockMatrix for distributed matrix > multiplication bu

[MLLib]: Executor OutOfMemory in BlockMatrix Multiplication

2017-06-14 Thread Anthony Thomas
I've been experimenting with MlLib's BlockMatrix for distributed matrix multiplication but consistently run into problems with executors being killed due to memory constrains. The linked gist (here ) has a short example of multiplyi

unsubscribe

2016-06-27 Thread Thomas Ginter
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Pulling data from a secured SQL database

2015-10-30 Thread Thomas Ginter
but those machines do not have access to the databases directly. How can I pull the data from the SQL database on the smaller development machine and then have it distribute to the Spark cluster for processing? Can the driver pull data and then distribute execution? Thanks, Thomas Ginter 801

Re: IPv6 regression in Spark 1.5.1

2015-10-14 Thread Thomas Dudziak
ng(hostPort).hasPort, message) } On Wed, Oct 14, 2015 at 2:40 PM, Thomas Dudziak wrote: > It looks like Spark 1.5.1 does not work with IPv6. When > adding -Djava.net.preferIPv6Addresses=true on my dual stack server, the > driver fails with: > > 15/10/14 14:36:01 ERROR SparkConte

IPv6 regression in Spark 1.5.1

2015-10-14 Thread Thomas Dudziak
It looks like Spark 1.5.1 does not work with IPv6. When adding -Djava.net.preferIPv6Addresses=true on my dual stack server, the driver fails with: 15/10/14 14:36:01 ERROR SparkContext: Error initializing SparkContext. java.lang.AssertionError: assertion failed: Expected hostname at scala.Predef$.a

Yahoo's Caffe-on-Spark project

2015-09-29 Thread Thomas Dudziak
http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop I would be curious to learn what the Spark developer's plans are in this area (NNs, GPUs) and what they think of integration with existing NN frameworks like Caffe or Torch. cheers, Tom

Re: Replacing Esper with Spark Streaming?

2015-09-15 Thread Thomas Bernhardt
Let me say first, I'm the Esper project lead.Esper is alive and well and not at all obsolete. Esper provides event series analysis by providing an SQL92-standards event processing language (EPL). It allows to express situation detection logic very concisely, usually much more concisely then any

Setting Executor memory

2015-09-14 Thread Thomas Gerber
Hello, I was looking for guidelines on what value to set executor memory to (via spark.executor.memory for example). This seems to be important to avoid OOM during tasks, especially in no swap environments (like AWS EMR clusters). This setting is really about the executor JVM heap. Hence, in ord

Accumulator with non-java-serializable value ?

2015-09-09 Thread Thomas Dudziak
I want to use t-digest with foreachPartition and accumulators (essentially, create a t-digest per partition and add that to the accumulator leveraging the fact that t-digests can be added to each other). I can make t-digests kryo-serializable easily but java-serializable is not very easy. Now, when

Cores per executors

2015-09-09 Thread Thomas Gerber
"map"- (if any) and "reduce"-side shuffle sorting is unbound (ExternalAppendOnlyMap and ExternalSorter I guess)? Thanks, Thomas

Re: How to avoid shuffle errors for a large join ?

2015-09-01 Thread Thomas Dudziak
aking it slower. SMJ performance is probably 5x - 1000x better in > 1.5 for your case. > > > On Thu, Aug 27, 2015 at 6:03 PM, Thomas Dudziak wrote: > >> I'm getting errors like "Removing executor with no recent heartbeats" & >> "Missing an output lo

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Thomas Dudziak
imilar problems to this (reduce side failures for large joins (25bn > rows with 9bn)), and found the answer was to further up the > spark.sql.shuffle.partitions=1000. In my case, 16k partitions worked for > me, but your tables look a little denser, so you may want to go even higher. > > On Thu,

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Thomas Dudziak
he answer was to further up the > spark.sql.shuffle.partitions=1000. In my case, 16k partitions worked for > me, but your tables look a little denser, so you may want to go even higher. > > On Thu, Aug 27, 2015 at 6:04 PM Thomas Dudziak wrote: > >> I'm getting err

How to avoid shuffle errors for a large join ?

2015-08-27 Thread Thomas Dudziak
I'm getting errors like "Removing executor with no recent heartbeats" & "Missing an output location for shuffle" errors for a large SparkSql join (1bn rows/2.5TB joined with 1bn rows/30GB) and I'm not sure how to configure the job to avoid them. The initial stage completes fine with some 30k tasks

Re: Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
: > > Have you tried tablesample? You find the exact syntax in the > documentation, but it exlxactly does what you want > > Le mer. 26 août 2015 à 18:12, Thomas Dudziak a écrit : > >> Sorry, I meant without reading from all splits. This is a single >> partition in the tab

Re: Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
Sorry, I meant without reading from all splits. This is a single partition in the table. On Wed, Aug 26, 2015 at 8:53 AM, Thomas Dudziak wrote: > I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from > and I don't particularly care which rows. Doing a LIMIT un

Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from and I don't particularly care which rows. Doing a LIMIT unfortunately results in two stages where the first stage reads the whole table, and the second then performs the limit with a single worker, which is not very efficien

Re: GraphX - ConnectedComponents (Pregel) - longer and longer interval between jobs

2015-06-29 Thread Thomas Gerber
of this RDD Which means the when a job uses that RDD, the DAG stops at that RDD and does not looks at its parents as it doesn't have them anymore. It is very similar to saving your RDD and re-loading it as a "fresh" RDD. On Fri, Jun 26, 2015 at 9:14 AM, Thomas Gerber wrote: &

Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
used by seeing skipped stages in the job UI. They are > periodically cleaned up based on available space of the configured > spark.local.dirs paths. > > From: Thomas Gerber > Date: Monday, June 29, 2015 at 10:12 PM > To: user > Subject: Shuffle files lifecycle > > Hello

Re: Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
Ah, for #3, maybe this is what *rdd.checkpoint *does! https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD Thomas On Mon, Jun 29, 2015 at 7:12 PM, Thomas Gerber wrote: > Hello, > > It is my understanding that shuffle are written on disk and that they

Shuffle files lifecycle

2015-06-29 Thread Thomas Gerber
? Thanks Thomas

Re: GraphX - ConnectedComponents (Pregel) - longer and longer interval between jobs

2015-06-26 Thread Thomas Gerber
Note that this problem is probably NOT caused directly by GraphX, but GraphX reveals it because as you go further down the iterations, you get further and further away of a shuffle you can rely on. On Thu, Jun 25, 2015 at 7:43 PM, Thomas Gerber wrote: > Hello, > > We r

Exception when using CLUSTER BY or ORDER BY

2015-05-19 Thread Thomas Dudziak
Under certain circumstances that I haven't yet been able to isolate, I get the following error when doing a HQL query using HiveContext (Spark 1.3.1 on Mesos, fine-grained mode). Is this a known problem or should I file a JIRA for it ? org.apache.spark.SparkException: Can only zip RDDs with same

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Thomas Dudziak
grained scheduler, there is a spark.cores.max config setting that > will limit the total # of cores it grabs. This was there in earlier > versions too. > > Matei > > > On May 19, 2015, at 12:39 PM, Thomas Dudziak wrote: > > > > I read the other day that there will b

Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Thomas Dudziak
I read the other day that there will be a fair number of improvements in 1.4 for Mesos. Could I ask for one more (if it isn't already in there): a configurable limit for the number of tasks for jobs run on Mesos ? This would be a very simple yet effective way to prevent a job dominating the cluster

Re: Error communicating with MapOutputTracker

2015-05-15 Thread Thomas Gerber
)? 2. how can we increase the heap for it? Especially when using spark-submit? Thanks, Thomas PS: akka parameter that one might want to increase: # akka timeouts/heartbeats settings multiplied by 10 to avoid problems spark.akka.timeout 1000 spark.akka.heartbeat.pauses 6 spark.akka.failure

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-15 Thread Thomas Dudziak
I've just been through this exact case with shaded guava in our Mesos setup and that is how it behaves there (with Spark 1.3.1). cheers, Tom On Fri, May 15, 2015 at 12:04 PM, Marcelo Vanzin wrote: > On Fri, May 15, 2015 at 11:56 AM, Thomas Dudziak wrote: > >> Actually t

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-15 Thread Thomas Dudziak
Actually the extraClassPath settings put the extra jars at the end of the classpath so they won't help. Only the deprecated SPARK_CLASSPATH puts them at the front. cheers, Tom On Fri, May 15, 2015 at 11:54 AM, Marcelo Vanzin wrote: > Ah, I see. yeah, it sucks that Spark has to expose Optional (

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-15 Thread Thomas Dudziak
This is still a problem in 1.3. Optional is both used in several shaded classes within Guava (e.g. the Immutable* classes) and itself uses shaded classes (e.g. AbstractIterator). This causes problems in application code. The only reliable way we've found around this is to shade Guava ourselves for

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Thomas Gerber
So, 1. I reduced my -XX:ThreadStackSize to 5m (instead of 10m - default is 1m), which is still OK for my need. 2. I reduced the executor memory to 44GB for a 60GB machine (instead of 49GB). This seems to have helped. Thanks to Matthew and Sean. Thomas On Tue, Mar 24, 2015 at 3:49 PM, Matt

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Thomas Gerber
Additional notes: I did not find anything wrong with the number of threads (ps -u USER -L | wc -l): around 780 on the master and 400 on executors. I am running on 100 r3.2xlarge. On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber wrote: > Hello, > > I am seeing various crashes in spark

java.lang.OutOfMemoryError: unable to create new native thread

2015-03-24 Thread Thomas Gerber
doesn't help. Does anyone know how to avoid those kinds of errors? Noteworthy: I added -XX:ThreadStackSize=10m on both driver and executor extra java options, which might have amplified the problem. Thanks for you help, Thomas

Re: Driver disassociated

2015-03-05 Thread Thomas Gerber
getInt("spark.akka.heartbeat.interval", 1000) > > Cheers > > On Wed, Mar 4, 2015 at 4:09 PM, Thomas Gerber > wrote: > >> Also, >> >> I was experiencing another problem which might be related: >> "Error communicating with MapOutputTracker"

Re: Driver disassociated

2015-03-04 Thread Thomas Gerber
Also, I was experiencing another problem which might be related: "Error communicating with MapOutputTracker" (see email in the ML today). I just thought I would mention it in case it is relevant. On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber wrote: > 1.2.1 > > Also, I was

Re: Driver disassociated

2015-03-04 Thread Thomas Gerber
. Thanks, Thomas On Wed, Mar 4, 2015 at 3:21 PM, Ted Yu wrote: > What release are you using ? > > SPARK-3923 went into 1.2.0 release. > > Cheers > > On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber > wrote: > >> Hello, >> >> sometimes, in the *middle*

Driver disassociated

2015-03-04 Thread Thomas Gerber
Hello, sometimes, in the *middle* of a job, the job stops (status is then seen as FINISHED in the master). There isn't anything wrong in the shell/submit output. When looking at the executor logs, I see logs like this: 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker acto

Re: Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
I meant spark.default.parallelism of course. On Wed, Mar 4, 2015 at 10:24 AM, Thomas Gerber wrote: > Follow up: > We re-retried, this time after *decreasing* spark.parallelism. It was set > to 16000 before, (5 times the number of cores in our cluster). It is now > down to 6400

Re: Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
the number of tasks it can track? On Wed, Mar 4, 2015 at 8:15 AM, Thomas Gerber wrote: > Hello, > > We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge workers). > We use spark-submit to start an application. > > We got the following error which leads to a fai

Spark logs in standalone clusters

2015-03-04 Thread Thomas Gerber
. Any other log file? Thanks, Thomas

Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
executors stderr, and all show similar logs, on both runs (see below). As far as we can tell, executors and master have disk space left. *Any suggestion on where to look to understand why the communication with the MapOutputTracker fails?* Thanks Thomas In case it matters, our akka settings

Re: Which OutputCommitter to use for S3?

2015-02-26 Thread Thomas Demoor
FYI. We're currently addressing this at the Hadoop level in https://issues.apache.org/jira/browse/HADOOP-9565 Thomas Demoor On Mon, Feb 23, 2015 at 10:16 PM, Darin McBeath wrote: > Just to close the loop in case anyone runs into the same problem I had. > > By setting --hadoop-m

Re: Executors dropping all memory stored RDDs?

2015-02-24 Thread Thomas Gerber
of disk. So, in case someone else notices a behavior like this, make sure you check your cluster monitor (like ganglia). On Wed, Jan 28, 2015 at 5:40 PM, Thomas Gerber wrote: > Hello, > > I am storing RDDs with the MEMORY_ONLY_SER Storage Level, during the run > of a big job. >

Shuffle Spill

2015-02-20 Thread Thomas Gerber
Hello, I have a few tasks in a stage with lots of tasks that have a large amount of shuffle spill. I scouted the web to understand shuffle spill, and I did not find any simple explanation of the spill mechanism. What I put together is: 1. the shuffle spill can happens when the shuffle is written

StackOverflowError on RDD.union

2015-02-03 Thread Thomas Kwan
I am trying to combine multiple RDDs into 1 RDD, and I am using the union function. I wonder if anyone has seen StackOverflowError as follows: Exception in thread "main" java.lang.StackOverflowError at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.Union

Re: performance of saveAsTextFile moving files from _temporary

2015-01-28 Thread Thomas Demoor
final output by using a custom OutputCommitter which does not use a temporary location. Thomas Demoor skype: demoor.thomas mobile: +32 497883833 On Wed, Jan 28, 2015 at 3:54 AM, Josh Walton wrote: > I'm not sure how to confirm how the moving is happening, however, one of > the job

Re: SaveAsTextFile to S3 bucket

2015-01-27 Thread Thomas Demoor
object. It has no effect on the file "/dev/output" which is, as far as S3 cares, another object that happens to share part of the objectname with /dev. Thomas Demoor skype: demoor.thomas mobile: +32 497883833 On Tue, Jan 27, 2015 at 6:33 AM, Chen, Kevin wrote: > When spark saves rdd

Re: Spark and S3 server side encryption

2015-01-27 Thread Thomas Demoor
Spark uses the Hadoop filesystems. I assume you are trying to use s3n:// which, under the hood, uses the 3rd party jets3t library. It is configured through the jets3t.properties file (google "hadoop s3n jets3t") which you should put on Spark's classpath. The setting you are looking for is s3servic

Querying over mutliple (avro) files using Spark SQL

2015-01-13 Thread thomas j
Hi, I have a program that loads a single avro file using spark SQL, queries it, transforms it and then outputs the data. The file is loaded with: val records = sqlContext.avroFile(filePath) val data = records.registerTempTable("data") ... Now I want to run it over tens of thousands of Avro file

Add PredictionIO to Powered by Spark

2015-01-05 Thread Thomas Stone
open source technology, such as Scala, Apache Spark, HBase and Elasticsearch. We are already featured on https://databricks.com/certified-on-spark Kind regards and Happy New Year! Thomas — This page tracks the users of Spark. To add yourself to the list, please email user@spark.apache.org

Re: Problem with StreamingContext - getting SPARK-2243

2014-12-27 Thread Thomas Frisk
>> But I dont think that I have another SparkContext running. Is there any >> way >> I can check this or force kill ? I've tried restarting the server as I'm >> desperate but still I get the same issue. I was not getting this earlier >> today. >> >>

weights not changed with different reg param

2014-12-23 Thread Thomas Kwan
thomas

retry in combineByKey at BinaryClassificationMetrics.scala

2014-12-23 Thread Thomas Kwan
to debug this. thanks in advance thomas

Re: How can I read this avro file using spark & scala?

2014-11-21 Thread thomas j
like this: person.map(r => (r.getInt(2), r)).take(4).collect() Is there any way to be able to specify the column name ("user_id") instead of needing to know/calculate the offset somehow? Thanks again On Fri, Nov 21, 2014 at 11:48 AM, thomas j wrote: > Thanks for the pointer Mich

Re: How can I read this avro file using spark & scala?

2014-11-21 Thread thomas j
ption: >> org.apache.avro.mapred.AvroWrapper >> >> How can I read the following sample file in spark using scala? >> >> http://www.4shared.com/file/SxnYcdgJce/sample.html >> >> Thomas >> > >

Re: Integrating Spark with other applications

2014-11-07 Thread Thomas Risberg
ays we now support Hive and Pig jobs in the spring-hadoop project. In fact, I added a spring-hadoop-spark sub-project earlier, but there is no real code there yet. Hoping to get this added soon, so some helpful pointers would be great. -Thomas [1] https://github.com/spring-projects/spring-hadoop/t

unsubscribe

2014-10-28 Thread Ricky Thomas

How to set JAVA_HOME with --deploy-mode cluster

2014-10-27 Thread Thomas Risberg
ssBuilder.start(ProcessBuilder.java:1028) ... 4 more Thanks, Thomas

Re: How to use spark-cassandra-connector in spark-shell?

2014-08-08 Thread Thomas Nieborowski
--- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Thomas Nieborowski 510-207-7049 mobile 510-339-1716 home

Re: Save an RDD to a SQL Database

2014-08-07 Thread Thomas Nieborowski
ive at Nabble.com. >> >> ----- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- Thomas Nieborowski 510-207-7049 mobile 510-339-1716 home

Re: RDD registerAsTable gives error on regular scala class records

2014-07-10 Thread Thomas Robert
// 30 ) extends Product { ... } I managed to register tables larger than 22 columns with this method. Bye. -- *Thomas ROBERT* www.creativedata.fr 2014-07-10 14:39 GMT+02:00 Kefah Issa : > Hi, > > SQL on spark 1.0 is an interesting feature. It works fine when the > &q

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-04 Thread Thomas Robert
I'm using those right... Thanks, -- *Thomas ROBERT* www.creativedata.fr 2014-07-03 16:16 GMT+02:00 Eustache DIEMERT : > Printing the model show the intercept is always 0 :( > > Should I open a bug for that ? > > > 2014-07-02 16:11 GMT+02:00 Eustache DIEMERT :

Re: Using Spark

2014-06-22 Thread Ricky Thomas
Awesome, thanks On Sunday, June 22, 2014, Matei Zaharia wrote: > Alright, added you. > > On Jun 20, 2014, at 2:52 PM, Ricky Thomas > wrote: > > Hi, > > Would like to add ourselves to the user list if possible please? > > Company: truedash > url: truedash.io

Fwd: Using Spark

2014-06-20 Thread Ricky Thomas
Hi, Would like to add ourselves to the user list if possible please? Company: truedash url: truedash.io Automatic pulling of all your data in to Spark for enterprise visualisation, predictive analytics and data exploration at a low cost. Currently in development with a few clients. Thanks

Task splitting among workers

2014-04-19 Thread David Thomas
During a Spark stage, how are tasks split among the workers? Specifically for a HadoopRDD, who determines which worker has to get which task?

Checkpoint Vs Cache

2014-04-13 Thread David Thomas
What is the difference between checkpointing and caching an RDD?

Re: Resilient nature of RDD

2014-04-03 Thread David Thomas
but the > re-computation will occur on an executor. So if several partitions are > lost, e.g. due to a few machines failing, the re-computation can be striped > across the cluster making it fast. > > > On Wed, Apr 2, 2014 at 11:27 AM, David Thomas wrote: > >> Can someone e

Resilient nature of RDD

2014-04-02 Thread David Thomas
Can someone explain how RDD is resilient? If one of the partition is lost, who is responsible to recreate that partition - is it the driver program?

Spark webUI - application details page

2014-03-30 Thread David Thomas
Is there a way to see 'Application Detail UI' page (at master:4040) for completed applications? Currently, I can see that page only for running applications, I would like to see various numbers for the application after it has completed.

Re: Replicating RDD elements

2014-03-28 Thread David Thomas
ttp://in.linkedin.com/in/sonalgoyal> > > > > > On Fri, Mar 28, 2014 at 9:24 AM, David Thomas wrote: > >> How can we replicate RDD elements? Say I have 1 element and 100 nodes in >> the cluster. I need to replicate this one item on all the nodes i.e. >> effectively create an RDD of 100 elements. >> > >

Replicating RDD elements

2014-03-27 Thread David Thomas
How can we replicate RDD elements? Say I have 1 element and 100 nodes in the cluster. I need to replicate this one item on all the nodes i.e. effectively create an RDD of 100 elements.

Round Robin Partitioner

2014-03-13 Thread David Thomas
Is it possible to parition the RDD elements in a round robin fashion? Say I have 5 nodes in the cluster and 5 elements in the RDD. I need to ensure each element gets mapped to each node in the cluster.

Re: Are all transformations lazy?

2014-03-11 Thread David Thomas
Spark runtime/scheduler traverses the DAG starting from > that RDD and triggers evaluation of anything parent RDDs it needs that > aren't computed and cached yet. > > Any future operations build on the same DAG as long as you use the same > RDD objects and, if you used cache

Re: Are all transformations lazy?

2014-03-11 Thread David Thomas
ld be lazy, but > apparently uses an RDD.count call in its implementation: > https://spark-project.atlassian.net/browse/SPARK-1021). > > David Thomas > March 11, 2014 at 9:49 PM > For example, is distinct() transformation lazy? > > when I see the Spark source code, distin

Are all transformations lazy?

2014-03-11 Thread David Thomas
For example, is distinct() transformation lazy? when I see the Spark source code, distinct applies a map-> reduceByKey -> map function to the RDD elements. Why is this lazy? Won't the function be applied immediately to the elements of RDD when I call someRDD.distinct? /** * Return a new RDD

  1   2   >