Writing protobuf RDD to parquet

2023-01-20 Thread David Diebold
control, in case rely on *partitionBy *to arrange the files in different folders. But I'm not sure there is some built-in way to convert rdd of protobuf to dataframe in spark ? I would need to rely on this : https://github.com/saurfang/sparksql-protobuf. What do you think ? Kind regards, David

Question about bucketing and custom partitioners

2022-04-11 Thread David Diebold
any pointers on this ? Thanks ! David

Re: Question about spark.sql min_by

2022-02-21 Thread David Diebold
rk 3.3, up for release soon. It exists in SQL. You can still use it in >> SQL with `spark.sql(...)` in Python though, not hard. >> >> On Mon, Feb 21, 2022 at 4:01 AM David Diebold >> wrote: >> >>> Hello all, >>> >>> I'm trying to use the spar

Question about spark.sql min_by

2022-02-21 Thread David Diebold
up by productId") Is there a way I can rely on min_by directly in groupby ? Is there some code missing in pyspark wrapper to make min_by visible somehow ? Thank you in advance for your help. Cheers David

Re: groupMapReduce

2022-01-14 Thread David Diebold
Hello, In RDD api, you must be looking for reduceByKey. Cheers Le ven. 14 janv. 2022 à 11:56, frakass a écrit : > Is there a RDD API which is similar to Scala's groupMapReduce? > https://blog.genuine.com/2019/11/scalas-groupmap-and-groupmapreduce/ > > Thank you. > >

Re: Pyspark debugging best practices

2022-01-03 Thread David Diebold
of join operations makes execution plan too complicated at the end of the day ; checkpointing could help there ? Cheers, David Le jeu. 30 déc. 2021 à 16:56, Andrew Davidson a écrit : > Hi Gourav > > I will give databricks a try. > > Each data gets loaded into a data frame. > I

question about data skew and memory issues

2021-12-14 Thread David Diebold
them. But why would it need to put them in memory when doing in aggregation ? It looks to me that aggregation can be performed in a stream fashion, so I would not expect any oom at all.. Thank you in advance for your lights :) David

Re: Trying to hash cross features with mllib

2021-10-04 Thread David Diebold
Hello Sean, Thank you for the heads-up ! Interaction transform won't help for my use case as it returns a vector that I won't be able to hash. I will definitely dig further into custom transformations though. Thanks ! David Le ven. 1 oct. 2021 à 15:49, Sean Owen a écrit : > Are you look

Trying to hash cross features with mllib

2021-10-01 Thread David Diebold
from QuantileDiscretizer and other cool functions. Am I missing something in the transformation api ? Or is my approach to hashing wrong ? Or should we consider to extend the api somehow ? Thank you, kind regards, David

Re: Performance of PySpark jobs on the Kubernetes cluster

2021-08-11 Thread David Diebold
to see if I can give more work to the executors. Cheers, David Le mar. 10 août 2021 à 12:20, Khalid Mammadov a écrit : > Hi Mich > > I think you need to check your code. > If code does not use PySpark API effectively you may get this. I.e. if you > use pure Python/pandas

Missing stack function from SQL functions API

2021-06-14 Thread david . szakallas
I noticed that the stack SQL function is missing from the functions API. Could we add it?

RE: FW: Email to Spark Org please

2021-04-01 Thread Williams, David (Risk Value Stream)
Classification: Public Many thanks for the info. So you wouldn't use sklearn with Spark for large datasets but use it with smaller datasets and using hyperopt to build models in parallel for hypertuning on Spark? From: Sean Owen Sent: 26 March 2021 13:53 To: Williams, David (Risk Value

RE: FW: Email to Spark Org please

2021-03-26 Thread Williams, David (Risk Value Stream)
get that working in distributed, will we get benefits similar to spark ML? Best Regards, Dave Williams From: Sean Owen Sent: 26 March 2021 13:20 To: Williams, David (Risk Value Stream) Cc: user@spark.apache.org Subject: Re: FW: Email to Spark Org please -- This email has reached the Bank via

RE: FW: Email to Spark Org please

2021-03-26 Thread Williams, David (Risk Value Stream)
Classification: Limited Many thanks for your response Sean. Question - why spark is overkill for this and why is sklearn is faster please? It's the same algorithm right? Thanks again, Dave Williams From: Sean Owen mailto:sro...@gmail.com>> Sent: 25 March 2021 16:40 To: Williams, David

FW: Email to Spark Org please

2021-03-25 Thread Williams, David (Risk Value Stream)
Classification: Public Hi Team, We are trying to utilize ML Gradient Boosting Tree Classification algorithm and found the performance of the algorithm is very poor during training. We would like to see we can improve the performance timings since, it is taking 2 days for training for a

Re: S3a Committer

2021-02-02 Thread David Morin
Yes, that's true but this is not (yet) the case of the Openstack Swift S3 API Le mar. 2 févr. 2021 à 21:41, Henoc a écrit : > S3 is strongly consistent now > https://aws.amazon.com/s3/consistency/ > > Regards, > Henoc > > On Tue, Feb 2, 2021, 10:27 PM David Morin > wrot

S3a Committer

2021-02-02 Thread David Morin
Hi, I have some issues at the moment with S3 API of Openstack Swift (S3a). This one is eventually consistent and it causes lots of issues with my distributed jobs in Spark. Is the S3A committer able to fix that ? Or an "S3guard like" implementation is the only way ? David

Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread David Morin
8f7%40%3Cuser.spark.apache.org%3E > Probably we may want to add it in the SS guide doc. We didn't need it as > it just didn't work with eventually consistent model, and now it works > anyway but is very inefficient. > > > On Thu, Dec 24, 2020 at 6:16 AM David Morin > wrote: > >> D

Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread David Morin
Does it work with the standard AWS S3 solution and its new consistency model <https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/> ? Le mer. 23 déc. 2020 à 18:48, David Morin a écrit : > Thanks. > My Spark applications run on nodes based on d

Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread David Morin
> nodes should be able to read it immediately > > The solutions/workarounds depend on where you are hosting your Spark > application. > > > > *From: *David Morin > *Date: *Wednesday, December 23, 2020 at 11:08 AM > *To: *"user@spark.apache.org" > *Subject:

Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread David Morin
David

Re: Job is not able to perform Broadcast Join

2020-10-06 Thread David Edwards
After adding the sequential ids you might need a repartition? I've found using monotically increasing id before that the df goes to a single partition. Usually becomes clear in the spark ui though On Tue, 6 Oct 2020, 20:38 Sachit Murarka, wrote: > Yes, Even I tried the same first. Then I moved

unsubscribe

2020-04-26 Thread David Aspegren
unsubscribe

Re: wot no toggle ?

2020-04-16 Thread David Hesson
You may want to read about the JVM and have some degree of understanding what you're talking about, and then you'd know that those options have different meanings. You can view both at the same time, for example. On Thu, Apr 16, 2020, 2:13 AM jane thorpe wrote: >

Re: Going it alone.

2020-04-14 Thread David Hesson
> > I want to know if Spark is headed in my direction. > You are implying Spark could be. What direction are you headed in, exactly? I don't feel as if anything were implied when you were asked for use cases or what problem you are solving. You were asked to identify some use cases, of which

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread David Edwards
Hoff, wrote: > Thanks David, > > I did experiment with the .cache() keyword and have to admit I didn't see > any marked improvement on the sample that I was running, so yes I am a bit > apprehensive including it (not even sure why I actually left it in). > > When

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread David Edwards
Hi Ashley, I'm not an expert but think this is because spark does lazy execution and doesn't actually perform any actions until you do some kind of write, count or other operation on the dataframe. If you remove the count steps it will work out a more efficient execution plan reducing the number

unsubscribe

2020-01-03 Thread David Aspegren
unsubscribe

Re: How to use spark-on-k8s pod template?

2019-11-08 Thread David Mitchell
Are you using Spark 2.3 or above? See the documentation: https://spark.apache.org/docs/latest/running-on-kubernetes.html I looks like you do not need: --conf spark.kubernetes.driver.podTemplateFile='/spark-pod-template.yaml' \ --conf

Re: how to refresh the loaded non-streaming dataframe for each steaming batch ?

2019-09-06 Thread David Zhou
Not yet. Learning spark On Fri, Sep 6, 2019 at 2:17 PM Shyam P wrote: > cool ,but did you find a way or anyhelp or clue ? > > On Fri, Sep 6, 2019 at 11:40 PM David Zhou wrote: > >> I have the same question with yours >> >> On Thu, Sep 5, 2019 at 9:18 PM Shyam P

Question on streaming job wait and re-run

2019-09-06 Thread David Zhou
. Thanks. David

Re: how to refresh the loaded non-streaming dataframe for each steaming batch ?

2019-09-06 Thread David Zhou
I have the same question with yours On Thu, Sep 5, 2019 at 9:18 PM Shyam P wrote: > Hi, > > I am using spark-sql-2.4.1v to streaming in my PoC. > > how to refresh the loaded dataframe from hdfs/cassandra table every time > new batch of stream processed ? What is the practice followed in general

Re: Start point to read source codes

2019-09-05 Thread David Zhou
Hi Hichame, Thanks a lot. I forked it. There are lots of codes. Need documents to guide me which part I should start from. On Thu, Sep 5, 2019 at 1:30 PM Hichame El Khalfi wrote: > Hey David, > > You can the source code on GitHub: > https://github.com/apache/spark > &g

Spark 2.4.3 on Kubernetes Client mode fails

2019-05-26 Thread David Aspegren
) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) What does this mean? Thanks in advance David

RE: Run SQL on files directly

2018-12-08 Thread David Markovitz
Thanks Subhash I am familiar with the other APIs but I am curios about this specific one and I could not figure it out from the git repository. Best regards, David (דודו) Markovitz Technology Solutions Professional, Data Platform Microsoft Israel Mobile: +972-525-834-304 Office: +972-747-119

Run SQL on files directly

2018-12-08 Thread David Markovitz
ng etc.) with this syntax? Thanks Best regards, David (דודו) Markovitz Technology Solutions Professional, Data Platform Microsoft Israel Mobile: +972-525-834-304 Office: +972-747-119-274 [cid:image002.png@01D166A7.36DE1270]

Spark event logging with s3a

2018-11-08 Thread David Hesson
We are trying to use spark event logging with s3a as a destination for event data. We added these settings to the spark submits: spark.eventLog.dir s3a://ourbucket/sparkHistoryServer/eventLogs spark.eventLog.enabled true Everything works fine with smaller jobs, and we can see the history data

Seeing a framework registration loop with Spark 2.3.1 on DCOS 1.10.0

2018-09-04 Thread David Hesson
I’m attempting to use Spark 2.3.1 (spark-2.3.1-bin-hadoop2.7.tgz) in cluster mode and running into some issues. This is a cluster where we've had success using Spark 2.2.0 (spark-2.2.0-bin-hadoop2.7.tgz), and I'm simply upgrading our nodes with the new Spark 2.3.1 package and testing it out.

Re: How to add a new source to exsting struct streaming application, like a kafka source

2018-08-01 Thread David Rosenstrauch
On 08/01/2018 12:36 PM, Robb Greathouse wrote: How to unsubscribe? List-Unsubscribe: - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Fwd: Array[Double] two time slower then DenseVector

2018-05-09 Thread David Ignjić
Hello all, I am currently looking in 1 spark application to squeze little performance and here this code (attached in email) I looked in difference and in: org.apache.spark.sql.catalyst.CatalystTypeConverters.ArrayConverter if its primitive we still use boxing and unboxing version because in code

spark.python.worker.reuse not working as expected

2018-04-26 Thread David Figueroa
given this code block def return_pid(_): yield os.getpid() spark = SparkSession.builder.getOrCreate() pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect()) print(pids) pids = set(spark.sparkContext.range(32).mapPartitions(return_pid).collect()) print(pids) I was

What's the best way to have Spark a service?

2018-03-15 Thread David Espinosa
incompatibility problems with my scala app), Spark Jobsserver, Finch and I read that I could use also Spark Streams. Thanks in advance, David

Re: how to add columns to row when column has a different encoder?

2018-02-28 Thread David Capwell
interface to add such a UDF? Thanks for your help! On Mon, Feb 26, 2018, 3:50 PM David Capwell <dcapw...@gmail.com> wrote: > I have a row that looks like the following pojo > > case class Wrapper(var id: String, var bytes: Array[Byte]) > > Those bytes are a serialized

Re: SizeEstimator

2018-02-27 Thread David Capwell
n a different stage). Make sure number of executors is small (for example only one) else you are reducing the size of M for each executor. On Mon, Feb 26, 2018, 10:04 PM 叶先进 <advance...@gmail.com> wrote: > What type is for the buffer you mentioned? > > > On 27 Feb 2018, at

Re: SizeEstimator

2018-02-26 Thread David Capwell
rote: > Thanks David. Another solution is to convert the protobuf object to byte > array, It does speed up SizeEstimator > > On Mon, Feb 26, 2018 at 5:34 PM, David Capwell <dcapw...@gmail.com> wrote: > >> This is used to predict the current cost of memory so spark kno

Re: SizeEstimator

2018-02-26 Thread David Capwell
This is used to predict the current cost of memory so spark knows to flush or not. This is very costly for us so we use a flag marked in the code as private to lower the cost spark.shuffle.spill.numElementsForceSpillThreshold (on phone hope no typo) - how many records before flush This lowers

how to add columns to row when column has a different encoder?

2018-02-26 Thread David Capwell
I have a row that looks like the following pojo case class Wrapper(var id: String, var bytes: Array[Byte]) Those bytes are a serialized pojo that looks like this case class Inner(var stuff: String, var moreStuff: String) I right now have encoders for both the types, but I don't see how to

Re: Encoder with empty bytes deserializes with non-empty bytes

2018-02-21 Thread David Capwell
ent :: Nil) This causes the java code to see a byte[] which uses a different code path than linked. Since I did ArrayType(ByteTyep) I had to wrap the data in a ArrayData class On Wed, Feb 21, 2018 at 9:55 PM, David Capwell <dcapw...@gmail.com> wrote: > I am trying to create a Encoder for

Encoder with empty bytes deserializes with non-empty bytes

2018-02-21 Thread David Capwell
I am trying to create a Encoder for protobuf data and noticed something rather weird. When we have a empty ByteString (not null, just empty), when we deserialize we get back a empty array of length 8. I took the generated code and see something weird going on. UnsafeRowWriter 1. public

Re: How to hold some data in memory while processing rows in a DataFrame?

2018-01-23 Thread David Rosenstrauch
That sounds like it might fit the bill. I'll take a look - thanks! DR On Mon, Jan 22, 2018 at 11:26 PM, vermanurag wrote: > Looking at description of problem window functions may solve your issue. It > allows operation over a window that can include records

Re: How to hold some data in memory while processing rows in a DataFrame?

2018-01-23 Thread David Rosenstrauch
:41 PM, naresh Goud <nareshgoud.du...@gmail.com> wrote: > If I understand your requirement correct. > Use broadcast variables to replicate across all nodes the small amount of > data you wanted to reuse. > > > > On Mon, Jan 22, 2018 at 9:24 PM David Rosenstrauch

How to hold some data in memory while processing rows in a DataFrame?

2018-01-22 Thread David Rosenstrauch
This seems like an easy thing to do, but I've been banging my head against the wall for hours trying to get it to work. I'm processing a spark dataframe (in python). What I want to do is, as I'm processing it I want to hold some data from one record in some local variables in memory, and then

spark datatypes

2017-12-03 Thread David Hodefi
t; in byte code. // Defined with a private constructor so the companion object is the only possible instantiation. *private[sql] type InternalType = Int* " I know that Timestamp requires 10 digits that's why we use Long and not Int. Can someone explain why InternalType is Int? Thanks David

Re: Spark SQL - Truncate Day / Hour

2017-11-13 Thread David Hodefi
gt; > import org.apache.spark.sql.functions._ > val df = df.select(hour($"myDateColumn"), dayOfMonth($"myDateColumn"), > dayOfYear($"myDateColumn")) > > 2017-11-09 12:05 GMT+01:00 David Hodefi <davidhodeffi.w...@gmail.com>: > >> I would like to

Spark SQL - Truncate Day / Hour

2017-11-09 Thread David Hodefi
/contributing it as a pull request? Last question is, Looking at DateTImeUtils class code, it seems like implementation is not using any open library for handling dates i.e apache-common , Why implementing it instead of reusing open source? Thanks David

Testing spark e-mail list

2017-11-09 Thread David Hodeffi
Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you

RE: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread David Hodeffi
Testing Spark group e-mail Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has

spark sql truncate function

2017-10-30 Thread David Hodeffi
I saw that it is possible to truncate date function with MM or YY but it is not possible to truncate by WEEK ,HOUR, MINUTE. Am I right? Is there any objection to support it or it is just not implemented yet. Thanks David Confidentiality: This communication and any attachments are intended

Dynamic Accumulators in 2.x?

2017-10-11 Thread David Capwell
I wrote a spark instrumentation tool that instruments RDDs to give more fine-grain details on what is going on within a Task. This is working right now, but uses volatiles and CAS to pass around this state (which slows down the task). We want to lower the overhead of this and make the main call

add jars to spark's runtime

2017-10-11 Thread David Capwell
We want to emit the metrics out of spark into our own custom store. To do this we built our own sink and tried to add it to spark by doing --jars path/to/jar and defining the class in metrics.properties which is supplied with the job. We noticed that spark kept crashing with the below exception

Re: error in running StructuredStreaming-Kafka integration code (Spark 2.x & Kafka 10)

2017-07-10 Thread David Newberger
Karen, It looks like the Kafka version is incorrect. You mention Kafka 0.10 however the classpath references Kafka 0.9 Thanks, David On July 10, 2017 at 1:44:06 PM, karan alang (karan.al...@gmail.com) wrote: Hi All, I'm running Spark Streaming - Kafka integration using Spark 2.x & Kafk

Re: How to perform clean-up after stateful streaming processes an RDD?

2017-06-06 Thread David Rosenstrauch
n the driver, *but* the > code within the rdd.operation{...} closure is a spark operation and executes > distributed on the executors. > One must be careful of not incorrectly mixing the scopes, in particular when > holding on to local state. > > On Wed, Jun 7, 2017 at 1:08 AM,

Re: How to perform clean-up after stateful streaming processes an RDD?

2017-06-06 Thread David Rosenstrauch
he code within the rdd.operation{...} closure is a spark operation and > executes distributed on the executors. > One must be careful of not incorrectly mixing the scopes, in particular > when holding on to local state. > > > > On Wed, Jun 7, 2017 at 1:08 AM, David Rosenstrauch

How to perform clean-up after stateful streaming processes an RDD?

2017-06-06 Thread David Rosenstrauch
We have some code we've written using stateful streaming (mapWithState) which works well for the most part. The stateful streaming runs, processes the RDD of input data, calls the state spec function for each input record, and does all proper adding and removing from the state cache. However, I

statefulStreaming checkpointing too often

2017-06-01 Thread David Rosenstrauch
I'm running into a weird issue with a stateful streaming job I'm running. (Spark 2.1.0 reading from kafka 0-10 input stream.) >From what I understand from the docs, by default the checkpoint interval for stateful streaming is 10 * batchInterval. Since I'm running a batch interval of 10 seconds,

Re: Kafka 0.8.x / 0.9.x support in structured streaming

2017-05-15 Thread David Kaczynski
I haven't done Structured Streaming in Spark 2.1 with Kafka 0.9.x, but I did do stream processing with Spark 2.0.1 and Kafka 0.10. Here's the official documenation that I used for Spark Streaming with Kafka 0.10: https://spark.apache.org/docs/2.1.0/streaming-kafka-integration.html. It looks like

Re: running spark program on intellij connecting to remote master for cluster

2017-05-10 Thread David Kaczynski
Do you have Spark installed locally on your laptop with IntelliJ? Are you using the SparkLauncher class or your local spark-submit script? A while back, I was trying to submit a spark job from my local workstation to a remote cluster using the SparkLauncher class, but I didn't actually have

Re: Exactly-once semantics with kakfa CanCommitOffsets.commitAsync?

2017-04-28 Thread David Rosenstrauch
ger <c...@koeninger.org> wrote: > From that doc: > > " However, Kafka is not transactional, so your outputs must still be > idempotent. " > > > > On Fri, Apr 28, 2017 at 10:29 AM, David Rosenstrauch <daro...@gmail.com> > wrote: > >

Spark user list seems to be rejecting/ignoring my emails from other subscribed address

2017-04-28 Thread David Rosenstrauch
I've been subscribed to the user@spark.apache.org list at another email address since 2014. That address receives all emails sent to the list without a problem. But for some reason any emails that *I* send to the list from that address get ignored or rejected. (Similarly, any emails I send to

Exactly-once semantics with kakfa CanCommitOffsets.commitAsync?

2017-04-28 Thread David Rosenstrauch
I'm doing a POC to test recovery with spark streaming from Kafka. I'm using the technique for storing the offsets in Kafka, as described at: https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html#kafka-itself I.e., grabbing the list of offsets before I start processing a

[MLlib] Multiple estimators for cross validation

2017-03-14 Thread David Leifker
I am hoping to open a discussion around the cross validation in mllib. I found that I often wanted to evaluate multiple estimators/pipelines (with different algorithms) or the same estimator with different parameter grids. The CrossValidator and TrainValidationSplit only allow a single estimator

Order of rows not preserved after cache + count + coalesce

2017-02-13 Thread David Haglund (external)
and pyspark. Regards, /David The information in this email may be confidential and/or legally privileged. It has been sent for the sole use of the intended recipient(s). If you are not an intended recipient, you are strictly prohibited from reading, disclosing, distributing, copying or using

Re: Do jobs fail because of other users of a cluster?

2017-01-24 Thread David Frese
and other things), or can the actual value change dynamically, depending on the environment? Anything else that could significantly change the memory requirements from run to run (with same program, data and settings)? -- David

Do jobs fail because of other users of a cluster?

2017-01-18 Thread David Frese
Hello everybody, being quite new to Spark, I am struggling a lot with OutOfMemory exceptions and "GC overhead limit reached" failures of my jobs, submitted from a spark-shell and "master yarn". Playing with --num-executors, --executor-memory and --executor-cores I occasionally get something

Spark Storage Tab is empty

2016-12-26 Thread David Hodeffi
I have tried the following code but didn't see anything on the storage tab. val myrdd = sc.parallelilize(1 to 100) myrdd.setName("my_rdd") myrdd.cache() myrdd.collect() Storage tab is empty, though I can see the stage of collect() . I am using 1.6.2 ,HDP 2.5 , spark on yarn Th

Re: SPARK -SQL Understanding BroadcastNestedLoopJoin and number of partitions

2016-12-21 Thread David Hodeffi
On Dec 21, 2016, at 14:43, David Hodeffi <david.hode...@niceactimize.com<mailto:david.hode...@niceactimize.com>> wrote: I have two dataframes which I am joining. small and big size dataframess. The optimizer suggest to use BroadcastNestedLoopJoin. number of partitions for the big Data

SPARK -SQL Understanding BroadcastNestedLoopJoin and number of partitions

2016-12-21 Thread David Hodeffi
I have two dataframes which I am joining. small and big size dataframess. The optimizer suggest to use BroadcastNestedLoopJoin. number of partitions for the big Dataframe is 200 while small Dataframe has 5 partitions. The joined dataframe results with 205 partitions

RE: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread David Hodeffi
I am not familiar of any problem with that. Anyway, If you run spark applicaction you would have multiple jobs, which makes sense that it is not a problem. Thanks David. From: Naveen [mailto:hadoopst...@gmail.com] Sent: Wednesday, December 21, 2016 9:18 AM To: d...@spark.apache.org; user

Re: What benefits do we really get out of colocation?

2016-12-03 Thread David Mitchell
To get a node local read from Spark to Cassandra, one has to use a read consistency level of LOCAL_ONE. For some use cases, this is not an option. For example, if you need to use a read consistency level of LOCAL_QUORUM, as many use cases demand, then one is not going to get a node local read.

Tracking opened files by Spark application

2016-11-25 Thread David Lauzon
Hi there! Does Apache Spark offers callback or some kind of plugin mechanism that would allow this? For a bit more context, I am looking to create a Record-Replay environment using Apache Spark as the core processing engine. The goals are: 1. Trace the origins of every file generated by Spark:

newAPIHadoopFile throws a JsonMappingException: Infinite recursion (StackOverflowError) error

2016-11-17 Thread David Robison
abind.ser.BeanSerializer.serialize(BeanSerializer.java:156) Any thoughts as to what may be going wrong? David David R Robison Senior Systems Engineer O. +1 512 247 3700 M. +1 757 286 0022 david.robi...@psgglobal.net<mailto:david.robi...@psgglobal.net> www.psgglobal.net<http://www.psgglo

RE: submitting a spark job using yarn-client and getting NoClassDefFoundError: org/apache/spark/Logging

2016-11-16 Thread David Robison
ncher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) I’ve checked and the class does live in the spark assembly. Any thoughts as what might be wrong? Best Regards, David R Robison Senior Systems Engineer [cid:image004.png@01D19182.F24CA3E0] From: Dav

RE: Problem submitting a spark job using yarn-client as master

2016-11-16 Thread David Robison
r of the JavaSparkContext. I have a logging call right after creating the SparkContext and it is never executied. Any idea what I’m doing wrong? David Best Regards, David R Robison Senior Systems Engineer [cid:image004.png@01D19182.F24CA3E0] From: Rohit Verma [mailto:rohit.ve...@rokittech.com]

Problem submitting a spark job using yarn-client as master

2016-11-15 Thread David Robison
I can get this to work? Thanks, David David R Robison Senior Systems Engineer O. +1 512 247 3700 M. +1 757 286 0022 david.robi...@psgglobal.net<mailto:david.robi...@psgglobal.net> www.psgglobal.net<http://www.psgglobal.net/> Prometheus Security Group Global, Inc. 3019 Alvin Devane Boulevard

creating a javaRDD using newAPIHadoopFile and FixedLengthInputFormat

2016-11-15 Thread David Robison
point I get the following exception: Error executing mapreduce job: com.fasterxml.jackson.databind.JsonMappingException: Infinite recursion (StackOverflowError) Any idea what I am doing wrong? I am new to Spark. David David R Robison Senior Systems Engineer O. +1 512 247 3700 M. +1 757

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-02 Thread Michael David Pedersen
Awesome, thank you Michael for the detailed example! I'll look into whether I can use this approach for my use case. If so, I could avoid the overhead of repeatedly registering a temp table for one-off queries, instead registering the table once and relying on the injected strategy. Don't know

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael David Pedersen
Thanks for the link, I hadn't come across this. According to https://forums.databricks.com/questions/400/what-is-the- > difference-between-registertemptable-a.html > > and I quote > > "registerTempTable() > > registerTempTable() creates an in-memory table that is scoped to the > cluster in which

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael David Pedersen
Hi again Mich, "But the thing is that I don't explicitly cache the tempTables ..". > > I believe tempTable is created in-memory and is already cached > That surprises me since there is a sqlContext.cacheTable method to explicitly cache a table in memory. Or am I missing something? This could

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Michael David Pedersen
Hi Mich, Thank you again for your reply. As I see you are caching the table already sorted > > val keyValRDDSorted = keyValRDD.sortByKey().cache > > and the next stage is you are creating multiple tempTables (different > ranges) that cache a subset of rows already cached in RDD. The data stored

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Michael David Pedersen
Hi Mich, Thank you for your quick reply! What type of table is the underlying table? Is it Hbase, Hive ORC or what? > It is a custom datasource, but ultimately backed by HBase. > By Key you mean a UNIQUE ID or something similar and then you do multiple > scans on the tempTable which stores

Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Michael David Pedersen
Hello, I've got a Spark SQL dataframe containing a "key" column. The queries I want to run start by filtering on the key range. My question in outline: is it possible to sort the dataset by key so as to do efficient key range filters, before subsequently running a more complex SQL query? I'm

Transforming Spark SQL AST with extraOptimizations

2016-10-25 Thread Michael David Pedersen
Hi, I'm wanting to take a SQL string as a user input, then transform it before execution. In particular, I want to modify the top-level projection (select clause), injecting additional columns to be retrieved by the query. I was hoping to achieve this by hooking into Catalyst using

Need Advice: Spark-Streaming Setup

2016-08-01 Thread David Kaufman
n't hesitate to ask. Thanks, David

RE: HBase-Spark Module

2016-07-29 Thread David Newberger
Hi Ben, This seems more like a question for community.cloudera.com. However, it would be in hbase not spark I believe. https://repository.cloudera.com/artifactory/webapp/#/artifacts/browse/tree/General/cloudera-release-repo/org/apache/hbase/hbase-spark David Newberger -Original Message

RE: difference between dataframe and dataframwrite

2016-06-16 Thread David Newberger
DataFrame is a collection of data which is organized into named columns. DataFrame.write is an interface for saving the contents of a DataFrame to external storage. Hope this helps David Newberger From: pseudo oduesp [mailto:pseudo20...@gmail.com] Sent: Thursday, June 16, 2016 9:43 AM

RE: streaming example has error

2016-06-16 Thread David Newberger
Try adding wordCounts.print() before ssc.start() David Newberger From: Lee Ho Yeung [mailto:jobmatt...@gmail.com] Sent: Wednesday, June 15, 2016 9:16 PM To: David Newberger Cc: user@spark.apache.org Subject: Re: streaming example has error got another error StreamingContext: Error starting

RE: Limit pyspark.daemon threads

2016-06-15 Thread David Newberger
;, the maximum amount of CPU cores to request for the application from across the cluster (not from each machine). If not set, the default will bespark.deploy.defaultCores on Spark's standalone cluster manager, or infinite (all available cores) on Mesos.” David Newberger From: agateaaa [mailto:ag

RE: Handle empty kafka in Spark Streaming

2016-06-15 Thread David Newberger
which is always running and doing something like windowed batching or microbatching or whatever I'm trying to accomplish. IF an RDD I get from Kafka is empty then I don't run the rest of the job. IF the RDD I'm get from Kafka has some number of events then I'll process the RDD further. David

RE: Handle empty kafka in Spark Streaming

2016-06-15 Thread David Newberger
If you're asking how to handle no messages in a batch window then I would add an isEmpty check like: dStream.foreachRDD(rdd => { if (!rdd.isEmpty()) ... } Or something like that. David Newberger -Original Message- From: Yogesh Vyas [mailto:informy...@gmail.com] Sent: Wednes

RE: streaming example has error

2016-06-15 Thread David Newberger
Have you tried to “set spark.driver.allowMultipleContexts = true”? David Newberger From: Lee Ho Yeung [mailto:jobmatt...@gmail.com] Sent: Tuesday, June 14, 2016 8:34 PM To: user@spark.apache.org Subject: streaming example has error when simulate streaming with nc -lk got error below

  1   2   3   >