Re: Query around Spark Checkpoints

2020-09-29 Thread Bryan Jeffrey
Jungtaek, How would you contrast stateful streaming with checkpoint vs. the idea of writing updates to a Delta Lake table, and then using the Delta Lake table as a streaming source for our state stream? Thank you, Bryan On Mon, Sep 28, 2020 at 9:50 AM Debabrata Ghosh wrote: > Thank

Re: Metrics Problem

2020-07-10 Thread Bryan Jeffrey
On Thu, Jul 2, 2020 at 2:33 PM Bryan Jeffrey wrote: > Srinivas, > > I finally broke a little bit of time free to look at this issue. I > reduced the scope of my ambitions and simply cloned a the ConsoleSink and > ConsoleReporter class. After doing so I can see the original

Re: Metrics Problem

2020-06-28 Thread Bryan Jeffrey
Srinivas, Interestingly, I did have the metrics jar packaged as part of my main jar. It worked well both on driver and locally, but not on executors. Regards, Bryan Jeffrey Get Outlook for Android<https://aka.ms/ghei36> From: Srinivas V Sent: Saturday

Re: Metrics Problem

2020-06-26 Thread Bryan Jeffrey
providers which appear to work. Regards, Bryan Get Outlook for Android<https://aka.ms/ghei36> From: Srinivas V Sent: Friday, June 26, 2020 9:47:52 PM To: Bryan Jeffrey Cc: user Subject: Re: Metrics Problem It should work when you are giving hdfs path a

Re: Metrics Problem

2020-06-26 Thread Bryan Jeffrey
It may be helpful to note that I'm running in Yarn cluster mode. My goal is to avoid having to manually distribute the JAR to all of the various nodes as this makes versioning deployments difficult. On Thu, Jun 25, 2020 at 5:32 PM Bryan Jeffrey wrote: > Hello. > > I am running Spark

Metrics Problem

2020-06-25 Thread Bryan Jeffrey
l.scala:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) ... 4 more Is it possible to pass a metrics JAR via --jars? If so what am I missing? Thank you, Bryan

Custom Metrics

2020-06-18 Thread Bryan Jeffrey
. Is there a suggested mechanism? Thank you, Bryan Jeffrey

Data Source - State (SPARK-28190)

2020-03-30 Thread Bryan Jeffrey
to be potentially addressed by your ticket SPARK-28190 - "Data Source - State". I see little activity on this ticket. Can you help me to understand where this feature currently stands? Thank you, Bryan Jeffrey

Re: Structured Streaming: mapGroupsWithState UDT serialization does not work

2020-02-29 Thread Bryan Jeffrey
like to come up to speed on the right way to validate changes and perhaps contribute myself. Regards, Bryan Jeffrey On Sat, Feb 29, 2020 at 9:52 AM Jungtaek Lim wrote: > Forgot to mention - it only occurs the SQL type of UDT is having fixed > length. If the UDT is used to represent c

Fwd: Structured Streaming: mapGroupsWithState UDT serialization does not work

2020-02-28 Thread Bryan Jeffrey
github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/types/SQLUserDefinedType.java>` > on the class definition. You dont seem to have done it, maybe thats the > reason? > > I would debug by printing the values in the serialize/deserialize methods, &g

Re: Structured Streaming: mapGroupsWithState UDT serialization does not work

2020-02-28 Thread Bryan Jeffrey
Perfect. I'll give this a shot and report back. Get Outlook for Android<https://aka.ms/ghei36> From: Tathagata Das Sent: Friday, February 28, 2020 6:23:07 PM To: Bryan Jeffrey Cc: user Subject: Re: Structured Streaming: mapGroupsWithState UDT seriali

Re: Structured Streaming: mapGroupsWithState UDT serialization does not work

2020-02-28 Thread Bryan Jeffrey
serialization bug also causes issues outside of stateful streaming, as when executing a simple group by. Regards, Bryan Jeffrey Get Outlook for Android<https://aka.ms/ghei36> From: Tathagata Das Sent: Friday, February 28, 2020 4:56:07 PM To: Bryan Jeffr

Structured Streaming: mapGroupsWithState UDT serialization does not work

2020-02-28 Thread Bryan Jeffrey
FooWithDate = FooWithDate(b.date, a.s + b.s, a.i + b.i) } The test output shows the invalid date: org.scalatest.exceptions.TestFailedException: Array(FooWithDate(2021-02-02T19:26:23.374Z,Foo,1), FooWithDate(2021-02-02T19:26:23.374Z,FooFoo,6)) did not contain the same elements as Array(FooWithDate(2020-01-02T03:04:05.006Z,Foo,1), FooWithDate(2020-01-02T03:04:05.006Z,FooFoo,6)) Is this something folks have encountered before? Thank you, Bryan Jeffrey

Accumulator v2

2020-01-21 Thread Bryan Jeffrey
, Bryan Jeffrey

Structured Streaming & Enrichment Broadcasts

2019-11-18 Thread Bryan Jeffrey
read an external data source and do a fast lookup for a streaming input. One option appears to be to do a broadcast left outer join? In the past this mechanism has been less easy to performance tune than doing an explicit broadcast and lookup. Regards, Bryan Jeffrey

Re: PySpark Pandas UDF

2019-11-17 Thread Bryan Cutler
und > this so it’s easier to debug. I’m CCing Bryan who might have some thoughts. > > On Tue, Nov 12, 2019 at 7:42 AM gal.benshlomo > wrote: > >> SOLVED! >> thanks for the help - I found the issue. it was the version of pyarrow >> (0.15.1) which apparently isn't cur

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-12 Thread Bryan Cutler
, 2019 at 6:08 PM Hyukjin Kwon wrote: >> > >> > +1 >> > >> > 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan 님이 작성: >> >> >> >> Sounds reasonable to me. We should make the behavior consistent within >> Spark. >> >> >> >>

[DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-04 Thread Bryan Cutler
Currently, when a PySpark Row is created with keyword arguments, the fields are sorted alphabetically. This has created a lot of confusion with users because it is not obvious (although it is stated in the pydocs) that they will be sorted alphabetically. Then later when applying a schema and the

Re: question about pyarrow.Table to pyspark.DataFrame conversion

2019-09-10 Thread Bryan Cutler
Hi Artem, I don't believe this is currently possible, but it could be a great addition to PySpark since this would offer a convenient and efficient way to parallelize nested column data. I created the JIRA https://issues.apache.org/jira/browse/SPARK-29040 for this. On Tue, Aug 27, 2019 at 7:55

Driver - Stops Scheduling Streaming Jobs

2019-08-27 Thread Bryan Jeffrey
anything pop out. Has anyone else seen this behavior? Any thoughts on debugging? Regards, Bryan Jeffrey

Re: Usage of PyArrow in Spark

2019-07-18 Thread Bryan Cutler
already have pandas_udfs, I'm not sure if it would be worth the effort but it might be a good experiment to see how much improvement it would bring. Bryan On Thu, Jul 18, 2019 at 12:02 AM Abdeali Kothari wrote: > I was thinking of implementing that. But quickly realized that doing a > conv

Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Bryan Cutler
+1 and the draft sounds good On Thu, May 30, 2019, 11:32 AM Xiangrui Meng wrote: > Here is the draft announcement: > > === > Plan for dropping Python 2 support > > As many of you already knew, Python core development team and many > utilized Python packages like Pandas and NumPy will drop

Re: pySpark - pandas UDF and binaryType

2019-05-02 Thread Bryan Cutler
Hi, BinaryType support was not added until Spark 2.4.0, see https://issues.apache.org/jira/browse/SPARK-23555. Also, pyarrow 0.10.0 or greater is require as you saw in the docs. Bryan On Thu, May 2, 2019 at 4:26 AM Nicolas Paris wrote: > Hi all > > I am using pySpark 2.3.0 and pyArr

Driver OOM does not shut down Spark Context

2019-01-31 Thread Bryan Jeffrey
the Streaming application never terminated. No new batches were started. As a result my job did not process data for some period of time (until our ancillary monitoring noticed the issue). *Ask: What can we do to ensure that the driver is shut down when this type of exception occurs?* Regards, Bryan

Re: spark2.4 arrow enabled true,error log not returned

2019-01-10 Thread Bryan Cutler
Hi, could you please clarify if you are running a YARN cluster when you see this problem? I tried on Spark standalone and could not reproduce. If it's on a YARN cluster, please file a JIRA and I can try to investigate further. Thanks, Bryan On Sat, Dec 15, 2018 at 3:42 AM 李斌松 wrote

Re: ConcurrentModificationExceptions with CachedKafkaConsumers

2018-08-31 Thread Bryan Jeffrey
Cody, Yes - I was able to verify that I am not seeing duplicate calls to createDirectStream. If the spark-streaming-kafka-0-10 will work on a 2.3 cluster I can go ahead and give that a shot. Regards, Bryan Jeffrey On Fri, Aug 31, 2018 at 11:56 AM Cody Koeninger wrote: > Just to be 100% s

Re: ConcurrentModificationExceptions with CachedKafkaConsumers

2018-08-31 Thread Bryan Jeffrey
. Regards, Bryan Jeffrey On Thu, Aug 30, 2018 at 2:56 PM Cody Koeninger wrote: > I doubt that fix will get backported to 2.3.x > > Are you able to test against master? 2.4 with the fix you linked to > is likely to hit code freeze soon. > > From a quick look at your code

ConcurrentModificationExceptions with CachedKafkaConsumers

2018-08-30 Thread Bryan Jeffrey
ructType(Array(StructField("value", BinaryType df.selectExpr("CAST('' AS STRING)", "value") .write .format("kafka") .option("kafka.bootstrap.servers", getBrokersToUse(brokers)) .option("compression.type", "gzip") .option("retries", "3") .option("topic", topic) .save() } Regards, Bryan Jeffrey

Re: Use Arrow instead of Pickle without pandas_udf

2018-07-30 Thread Bryan Cutler
Here is a link to the JIRA for adding StructType support for scalar pandas_udf https://issues.apache.org/jira/browse/SPARK-24579 On Wed, Jul 25, 2018 at 3:36 PM, Hichame El Khalfi wrote: > Hey Holden, > Thanks for your reply, > > We currently using a python function that produces a

Re: Arrow type issue with Pandas UDF

2018-07-19 Thread Bryan Cutler
by adding more keys or sampling a fraction of the data? If the problem persists could you make a jira? At the very least a better exception would be nice. Bryan On Thu, Jul 19, 2018, 7:07 AM Patrick McCarthy wrote: > PySpark 2.3.1 on YARN, Python 3.6, PyArrow 0.8. > > I'm trying to run a p

Heap Memory in Spark 2.3.0

2018-07-16 Thread Bryan Jeffrey
ocation in Spark 2.3.0 that would cause these issues? Thank you for any help you can provide. Regards, Bryan Jeffrey

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Bryan Jeffrey
Cody, Thank you. Let me see if I can reproduce this. We're not seeing offsets load correctly on startup - but perhaps there is an error on my side. Bryan Get Outlook for Android<https://aka.ms/ghei36> From: Cody Koeninger Sent: Thursday, June 14, 2018 5

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Bryan Jeffrey
Cody, Where is that called in the driver? The only call I see from Subscribe is to load the offset from checkpoint. Get Outlook for Android<https://aka.ms/ghei36> From: Cody Koeninger Sent: Thursday, June 14, 2018 4:24:58 PM To: Bryan Jeffrey Cc: user S

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Bryan Jeffrey
Cody, Can you point me to the code that loads offsets? As far as I can see with Spark 2.1, the only offset load is from checkpoint. Thank you! Bryan Get Outlook for Android<https://aka.ms/ghei36> From: Cody Koeninger Sent: Thursday, June 14, 2018 4:00

Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Bryan Jeffrey
reaming Kafka library. How is this expected to work? Is there an example of saving the offsets to Kafka and then loading them from Kafka? Regards, Bryan Jeffrey

Re: Pandas UDF for PySpark error. Big Dataset

2018-05-29 Thread Bryan Cutler
Can you share some of the code used, or at least the pandas_udf plus the stacktrace? Also does decreasing your dataset size fix the oom? On Mon, May 28, 2018, 4:22 PM Traku traku wrote: > Hi. > > I'm trying to use the new feature but I can't use it with a big dataset > (about 5 million rows).

Re: OneHotEncoderEstimator - java.lang.NoSuchMethodError: org.apache.spark.sql.Dataset.withColumns

2018-05-18 Thread Bryan Cutler
The example works for me, please check your environment and ensure you are using Spark 2.3.0 where OneHotEncoderEstimator was introduced. On Fri, May 18, 2018 at 12:57 AM, Matteo Cossu wrote: > Hi, > > are you sure Dataset has a method withColumns? > > On 15 May 2018 at

Re: How to use StringIndexer for multiple input /output columns in Spark Java

2018-05-16 Thread Bryan Cutler
Yes, the workaround is to create multiple StringIndexers as you described. OneHotEncoderEstimator is only in Spark 2.3.0, you will have to use just OneHotEncoder. On Tue, May 15, 2018, 8:40 AM Mina Aslani wrote: > Hi, > > So, what is the workaround? Should I create

Re: [Arrow][Dremio]

2018-05-15 Thread Bryan Cutler
this. Bryan On Mon, May 14, 2018 at 9:55 AM, Pierce Lamb <richard.pierce.l...@gmail.com> wrote: > Hi Xavier, > > Along the lines of connecting to multiple sources of data and replacing > ETL tools you may want to check out Confluent's blog on building a > real-time streaming ETL pipel

Re: Spark dataset to byte array over grpc

2018-04-23 Thread Bryan Cutler
lic. Your client could consume Arrow data directly or perhaps use spark.sql ColumnarBatch to read back as Rows. Bryan On Mon, Apr 23, 2018 at 11:49 AM, Ashwin Sai Shankar < ashan...@netflix.com.invalid> wrote: > Hi! > I'm building a spark app which runs a spark-sql query and send results

Re: PySpark ML: Get best set of parameters from TrainValidationSplit

2018-04-16 Thread Bryan Cutler
Hi Aakash, First you will want to get the the random forest model stage from the best pipeline model result, for example if RF is the first stage: rfModel = model.bestModel.stages[0] Then you can check the values of the params you tuned like this: rfModel.getNumTrees On Mon, Apr 16, 2018 at

Re: [Spark 2.x Core] Adding to ArrayList inside rdd.foreach()

2018-04-07 Thread Bryan Jeffrey
You can just call rdd.flatMap(_._2).collect Get Outlook for Android From: klrmowse Sent: Saturday, April 7, 2018 1:29:34 PM To: user@spark.apache.org Subject: Re: [Spark 2.x Core] Adding to ArrayList inside

Re: is there a way of register python UDF using java API?

2018-04-02 Thread Bryan Cutler
Hi Kant, The udfDeterministic would be set to false if the results from your UDF are non-deterministic, such as produced by random numbers, so the catalyst optimizer will not cache and reuse results. On Mon, Apr 2, 2018 at 12:11 PM, kant kodali wrote: > Looks like there is

Re: Return statements aren't allowed in Spark closures

2018-02-21 Thread Bryan Jeffrey
Lian, You're writing Scala. Just remove the 'return'. No need for it in Scala. Get Outlook for Android From: Lian Jiang Sent: Wednesday, February 21, 2018 4:16:08 PM To: user Subject: Return statements aren't

Re: PySpark Tweedie GLM

2018-02-09 Thread Bryan Cutler
Can you provide some code/data to reproduce the problem? On Fri, Feb 9, 2018 at 9:42 AM, nhamwey wrote: > I am using Spark 2.2.0 through Python. > > I am repeatedly getting a zero weight of sums error when trying to run a > model. This happens even when I do not

Re: ML:One vs Rest with crossValidator for multinomial in logistic regression

2018-02-08 Thread Bryan Cutler
apache.org/jira/browse/SPARK-10931 and the example now shows all model params. The fix will be in the Spark 2.3 release. Bryan On Wed, Jan 31, 2018 at 10:20 PM, Nicolas Paris <nipari...@gmail.com> wrote: > Hey > > I am also interested in how to get those parameters. > For exampl

Re: ML:One vs Rest with crossValidator for multinomial in logistic regression

2018-01-30 Thread Bryan Cutler
Hi Michelle, Your original usage of ParamGridBuilder was not quite right, `addGrid` expects (some parameter, array of values for that parameter). If you want to do a grid search with different regularization values, you would do the following: val paramMaps = new

Re: Timestamp changing while writing

2018-01-15 Thread Bryan Cutler
Spark internally stores timestamps as UTC values, so cearteDataFrame will covert from local time zone to UTC. I think there was a Jira to correct parquet output. Are the values you are seeing offset from your local time zone? On Jan 11, 2018 4:49 PM, "sk skk" wrote: >

Stopping a Spark Streaming Context gracefully

2017-11-07 Thread Bryan Jeffrey
. I looked in Jira and did not see an open issue. Is this a known problem? If not I'll open a bug. Regards, Bryan Jeffrey

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Bryan Cutler
entioned, you might want to follow SPARK-19071 , specifically https://issues.apache.org/jira/browse/SPARK-19357 which parallelizes model evaluation. Bryan On Tue, Sep 5, 2017 at 8:02 AM, Yanbo Liang <yblia...@gmail.com> wrote: > You are right, native Spark MLlib CrossValidation can't run *d

Re: about broadcast join of base table in spark sql

2017-06-30 Thread Bryan Jeffrey
Hello. If you want to allow broadcast join with larger broadcasts you can set spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause the plan to allow join despite 'A' being larger than the default threshold. Get Outlook for Android From: paleyl Sent:

Re: Question about Parallel Stages in Spark

2017-06-27 Thread Bryan Jeffrey
A and B. Jeffrey' code did not cause two submit. ---Original---From: "Pralabh Kumar"<pralabhku...@gmail.com>Date: 2017/6/27 12:09:27To: "萝卜丝炒饭"<1427357...@qq.com>;Cc: "user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;&

Re: Question about Parallel Stages in Spark

2017-06-26 Thread Bryan Jeffrey
Hello. The driver is running the individual operations in series, but each operation is parallelized internally. If you want them run in parallel you need to provide the driver a mechanism to thread the job scheduling out: val rdd1 = sc.parallelize(1 to 10) val rdd2 = sc.parallelize(1 to

Meetup in Taiwan

2017-06-25 Thread Yang Bryan
Hi, I'm Bryan, the co-founder of Taiwan Spark User Group. We discuss, share information on https://www.facebook.com/groups/spark.tw/. We have physical meetup twice a month. Please help us add on the official website. And We will hold a code competition about Spark, could we print the logo

Re: Broadcasts & Storage Memory

2017-06-21 Thread Bryan Jeffrey
of that storage memory fraction. Bryan Jeffrey Get Outlook for Android On Wed, Jun 21, 2017 at 6:48 PM -0400, "satish lalam" <satish.la...@gmail.com> wrote: My understanding is - it from storageFraction. Here cached blocks are immune to eviction - so bot

Broadcasts & Storage Memory

2017-06-21 Thread Bryan Jeffrey
the storage memory for cached RDDs. You end up with executor memory that looks like the following: All memory: 0-100 Spark memory: 0-75 RDD Storage: 0-37 Other Spark: 38-75 Other Reserved: 76-100 Where do broadcast variables fall into the mix? Regards, Bryan Jeffrey

Re: how many topics spark streaming can handle

2017-06-19 Thread Bryan Jeffrey
Hello Ashok, We're consuming from more than 10 topics in some Spark streaming applications. Topic management is a concern (what is read from where, etc), but I have seen no issues from Spark itself. Regards, Bryan Jeffrey Get Outlook for Android On Mon, Jun 19, 2017

Re: Scala, Python or Java for Spark programming

2017-06-07 Thread Bryan Jeffrey
provides a lot of similar features, but the amount of typing required to set down a small function is excessive at best! Regards, Bryan Jeffrey On Wed, Jun 7, 2017 at 12:51 PM, Jörn Franke <jornfra...@gmail.com> wrote: > I think this is a religious question ;-) > Java is often un

Re: Is there a way to do conditional group by in spark 2.1.1?

2017-06-03 Thread Bryan Jeffrey
You should be able to project a new column that is your group column. Then you can group on the projected column. Get Outlook for Android On Sat, Jun 3, 2017 at 6:26 PM -0400, "upendra 1991" wrote: Use a function Sent from Yahoo Mail on

Re: Convert camelCase to snake_case when saving Dataframe/Dataset to parquet?

2017-05-22 Thread Bryan Jeffrey
Mike, I have code to do that. I'll share it tomorrow. Get Outlook for Android On Mon, May 22, 2017 at 4:53 PM -0400, "Mike Wheeler" wrote: Hi Spark User, For Scala case class, we usually use camelCase (carType) for member fields. However,

Re: Crossvalidator after fit

2017-05-05 Thread Bryan Cutler
Looks like there might be a problem with the way you specified your parameter values, probably you have an integer value where it should be a floating-point. Double check that and if there is still a problem please share the rest of your code so we can see how you defined "gridS". On Fri, May 5,

Re: pandas DF Dstream to Spark DF

2017-04-10 Thread Bryan Cutler
Hi Yogesh, It would be easier to help if you included your code and the exact error messages that occur. If you are creating a Spark DataFrame with a Pandas DataFrame, then Spark does not read the schema and infers from the data to make one. This might be the cause of your issue if the schema

Re: [RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-04 Thread bryan . jeffrey
Rdd operation: rdd.map(x => (word, count)).reduceByKey(_+_) Get Outlook for Android On Sat, Mar 4, 2017 at 8:59 AM -0500, "Old-School" wrote: Hi, I want to perform some simple transformations and check the execution time, under

Re: streaming-kafka-0-8-integration (direct approach) and monitoring

2017-02-14 Thread bryan . jeffrey
Mohammad, We store our offsets in Cassandra,  and use that for tracking. This solved a few issues for us,  as it provides a good persistence mechanism even when you're reading from multiple clusters. Bryan Jeffrey Get Outlook for Android On Tue, Feb 14, 2017 at 7:03 PM

Re: How to specify "verbose GC" in Spark submit?

2017-02-06 Thread Bryan Jeffrey
ify --conf "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -verbose:gc", etc. Bryan Jeffrey On Mon, Feb 6, 2017 at 8:02 AM, Md. Rezaul Karim < rezaul.ka...@insight-centre.org> wrote: > Dear All, > > Is there any way to specify verbose GC -i.e. “-verbose:gc > -XX:

Re: Belief propagation algorithm is open sourced

2016-12-14 Thread Bryan Cutler
I'll check it out, thanks for sharing Alexander! On Dec 13, 2016 4:58 PM, "Ulanov, Alexander" wrote: > Dear Spark developers and users, > > > HPE has open sourced the implementation of the belief propagation (BP) > algorithm for Apache Spark, a popular message passing

Re: Streaming Batch Oddities

2016-12-13 Thread Bryan Jeffrey
All, Any thoughts? I can run another couple of experiments to try to narrow the problem. The total data volume in the repartition is around 60GB / batch. Regards, Bryan Jeffrey On Tue, Dec 13, 2016 at 12:11 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Hello. > > I

Streaming Batch Oddities

2016-12-13 Thread Bryan Jeffrey
which calls coalesce (shuffle=true), which creates a new ShuffledRDD with a HashPartitioner. These calls appear functionally equivelent - I am having trouble coming up with a justification for the significant performance differences between calls. Help? Regards, Bryan Jeffrey

Re: New to spark.

2016-09-28 Thread Bryan Cutler
Hi Anirudh, All types of contributions are welcome, from code to documentation. Please check out the page at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark for some info, specifically keep a watch out for starter JIRAs here

Re: Master OOM in "master-rebuild-ui-thread" while running stream app

2016-09-13 Thread Bryan Cutler
It looks like you have logging enabled and your application event log is too large for the master to build a web UI from it. In spark 1.6.2 and earlier, when an application completes, the master rebuilds a web UI to view events after the fact. This functionality was removed in spark 2.0 and the

Re: Random Forest Classification

2016-08-31 Thread Bryan Cutler
gt; wrote: > Hi Bryan, > Thanks for the reply. > I am indexing 5 columns ,then using these indexed columns to generate the > "feature" column thru vector assembler. > Which essentially means that I cannot use *fit()* directly on > "completeDataset" dataframe sin

Re: Random Forest Classification

2016-08-30 Thread Bryan Cutler
.setMaxCategories(180) > > val decisionTree = new DecisionTreeClassifier(). > setMaxBins(300).setMaxDepth(1).setImpurity("entropy"). > setLabelCol("indexed_user_action").setFeaturesCol("indexedfeature"). > setPredictionCol("prediction

Re: Grid Search using Spark MLLib Pipelines

2016-08-12 Thread Bryan Cutler
You will need to cast bestModel to include the MLWritable trait. The class Model does not mix it in by default. For instance: cvModel.bestModel.asInstanceOf[MLWritable].save("/my/path") Alternatively, you could save the CV model directly, which takes care of this cvModel.save("/my/path") On

Re: Why training data in Kmeans Spark streaming clustering

2016-08-11 Thread Bryan Cutler
The algorithm update is just broken into 2 steps: trainOn - to learn/update the cluster centers, and predictOn - predicts cluster assignment on data The StreamingKMeansExample you reference breaks up data into training and test because you might want to score the predictions. If you don't care

Re: Spark 2.0 - JavaAFTSurvivalRegressionExample doesn't work

2016-07-28 Thread Bryan Cutler
That's the correct fix. I have this done along with a few other Java examples that still use the old MLlib Vectors in this PR thats waiting for review https://github.com/apache/spark/pull/14308 On Jul 28, 2016 5:14 AM, "Robert Goodman" wrote: > I changed import in the sample

Event Log Compression

2016-07-26 Thread Bryan Jeffrey
? Thank you, Bryan Jeffrey

Spark 2.0

2016-07-25 Thread Bryan Jeffrey
be willing to go fix it myself). Should I just create a ticket? Thank you, Bryan Jeffrey

Re: Programmatic use of UDFs from Java

2016-07-22 Thread Bryan Cutler
Everett, I had the same question today and came across this old thread. Not sure if there has been any more recent work to support this. http://apache-spark-developers-list.1001551.n3.nabble.com/Using-UDFs-in-Java-without-registration-td12497.html On Thu, Jul 21, 2016 at 10:10 AM, Everett

Re: MLlib, Java, and DataFrame

2016-07-21 Thread Bryan Cutler
ML has a DataFrame based API, while MLlib is RDDs and will be deprecated as of Spark 2.0. On Thu, Jul 21, 2016 at 10:41 PM, VG <vlin...@gmail.com> wrote: > Why do we have these 2 packages ... ml and mlib? > What is the difference in these > > > > On Fri, Jul 22, 2016 a

Re: MLlib, Java, and DataFrame

2016-07-21 Thread Bryan Cutler
Hi JG, If you didn't know this, Spark MLlib has 2 APIs, one of which uses DataFrames. Take a look at this example https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java This example uses a Dataset, which is

Re: spark-submit local and Akka startup timeouts

2016-07-19 Thread Bryan Cutler
you might have luck trying a more recent version of Spark, such as 1.6.2 or even 2.0.0 (soon to be released) which no longer uses Akka and the ActorSystem. Hope that helps! On Tue, Jul 19, 2016 at 2:29 AM, Rory Waite <rwa...@sdl.com> wrote: > Sorry Bryan, I should have mentioned tha

Re: spark-submit local and Akka startup timeouts

2016-07-18 Thread Bryan Cutler
Hi Rory, for starters what version of Spark are you using? I believe that in a 1.5.? release (I don't know which one off the top of my head) there was an addition that would also display the config property when a timeout happened. That might help some if you are able to upgrade. On Jul 18,

Re: Random Forest Classification

2016-07-08 Thread Bryan Cutler
// Set maxCategories so features with > 4 distinct values are treated as continuous. val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(4).fit(digits) Hope that helps! On Fri, Jul 1, 2016 at 9:24 AM, Rich Tar

Re: ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputCommitter

2016-07-07 Thread Bryan Cutler
Can you try running the example like this ./bin/run-example sql.RDDRelation I know there are some jars in the example folders, and running them this way adds them to the classpath On Jul 7, 2016 3:47 AM, "kevin" wrote: > hi,all: > I build spark use: > >

Re: Set the node the spark driver will be started

2016-06-29 Thread Bryan Cutler
Hi Felix, I think the problem you are describing has been fixed in later versions, check out this JIRA https://issues.apache.org/jira/browse/SPARK-13803 On Wed, Jun 29, 2016 at 9:27 AM, Mich Talebzadeh wrote: > Fine. in standalone mode spark uses its own scheduling

Re: Random Forest Classification

2016-06-28 Thread Bryan Cutler
Are you fitting the VectorIndexer to the entire data set and not just training or test data? If you are able to post your code and some data to reproduce, that would help in troubleshooting. On Tue, Jun 28, 2016 at 4:40 PM, Rich Tarro wrote: > Thanks for the response, but

Re: Random Forest Classification

2016-06-28 Thread Bryan Cutler
The problem might be that you are evaluating with "predictionLabel" instead of "prediction", where predictionLabel is the prediction index mapped to the original label strings - at least according to the RandomForestClassifierExample, not sure if your code is exactly the same. On Tue, Jun 28,

Re: LogisticRegression.scala ERROR, require(Predef.scala)

2016-06-23 Thread Bryan Cutler
The stack trace you provided seems to hint that you are calling "predict" on an RDD with Vectors that are not the same size as the number of features in your trained model, they should be equal. If that's not the issue, it would be easier to troubleshoot if you could share your code and possibly

Re: Kafka Exceptions

2016-06-13 Thread Bryan Jeffrey
Cody, We already set the maxRetries. We're still seeing issue - when leader is shifted, for example, it does not appear that direct stream reader correctly handles this. We're running 1.6.1. Bryan Jeffrey On Mon, Jun 13, 2016 at 10:37 AM, Cody Koeninger <c...@koeninger.org> wrote:

Kafka Exceptions

2016-06-13 Thread Bryan Jeffrey
you, Bryan Jeffrey

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
All, Thank you for the replies. It seems as though the Dataset API is still far behind the RDD API. This is unfortunate as the Dataset API potentially provides a number of performance benefits. I will move to using it in a more limited set of cases for the moment. Thank you! Bryan Jeffrey

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
It would also be nice if there was a better example of joining two Datasets. I am looking at the documentation here: http://spark.apache.org/docs/latest/sql-programming-guide.html. It seems a little bit sparse - is there a better documentation source? Regards, Bryan Jeffrey On Tue, Jun 7, 2016

Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
Hello. I am looking at the option of moving RDD based operations to Dataset based operations. We are calling 'reduceByKey' on some pair RDDs we have. What would the equivalent be in the Dataset interface - I do not see a simple reduceByKey replacement. Regards, Bryan Jeffrey

Re: Specify node where driver should run

2016-06-06 Thread Bryan Cutler
plication master should run in the yarn > conf? I haven't found any useful information regarding that. > > Thanks. > > On Mon, Jun 6, 2016 at 4:52 PM, Bryan Cutler <cutl...@gmail.com> wrote: > >> In that mode, it will run on the application master, whichever node th

Re: Specify node where driver should run

2016-06-06 Thread Bryan Cutler
In that mode, it will run on the application master, whichever node that is as specified in your yarn conf. On Jun 5, 2016 4:54 PM, "Saiph Kappa" wrote: > Hi, > > In yarn-cluster mode, is there any way to specify on which node I want the > driver to run? > > Thanks. >

Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-29 Thread Bryan Cutler
This is currently being worked on, planned for 2.1 I believe https://issues.apache.org/jira/browse/SPARK-7159 On May 28, 2016 9:31 PM, "Stephen Boesch" wrote: > Thanks Phuong But the point of my post is how to achieve without using > the deprecated the mllib pacakge. The

Streaming application slows over time

2016-05-09 Thread Bryan Jeffrey
for or known bugs in similar instances? Regards, Bryan Jeffrey

Issues with Long Running Streaming Application

2016-04-25 Thread Bryan Jeffrey
, Bryan Jeffrey

Re: Spark 1.6.1. How to prevent serialization of KafkaProducer

2016-04-21 Thread Bryan Jeffrey
ns = kafkaWritePartitions) detectionWriter.write(dataToWriteToKafka) Hope that helps! Bryan Jeffrey On Thu, Apr 21, 2016 at 2:08 PM, Alexander Gallego <agall...@concord.io> wrote: > Thanks Ted. > > KafkaWordCount (producer) does not operate on a DStream[T

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Bryan Jeffrey
Cody et. al, I am seeing a similar error. I've increased the number of retries. Once I've got a job up and running I'm seeing it retry correctly. However, I am having trouble getting the job started - number of retries does not seem to help with startup behavior. Thoughts? Regards, Bryan

  1   2   >