Joining large data sets

2015-10-26 Thread Bryan
(broadcasting the smaller set). For joining two large datasets, it would seem to be better to repartition both sets in the same way then join each partition. It there a suggested practice for this problem? Thank you, Bryan Jeffrey

RE: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan
Jerry, Thank you for the note. It sounds like you were able to get further than I have been - any insight? Just a Spark 1.4.1 vs Spark 1.5? Regards, Bryan Jeffrey -Original Message- From: "Jerry Lam" <chiling...@gmail.com> Sent: ‎10/‎28/‎2015 6:29 PM To: "Bryan

RE: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan
storage location for the data. That seems very hacky though, and likely to result in maintenance issues. Regards, Bryan Jeffrey -Original Message- From: "Yana Kadiyska" <yana.kadiy...@gmail.com> Sent: ‎10/‎28/‎2015 8:32 PM To: "Bryan Jeffrey" <bryan.jeff..

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-10 Thread Bryan
Anyone have thoughts or a similar use-case for SparkSQL / Cassandra? Regards, Bryan Jeffrey -Original Message- From: "Bryan Jeffrey" <bryan.jeff...@gmail.com> Sent: ‎11/‎4/‎2015 11:16 AM To: "user" <user@spark.apache.org> Subject: Cassandra via SparkSQL

RE: Problems with Local Checkpoints

2015-09-14 Thread Bryan
Akhil, This looks like the issue. I'll update my path to include the (soon to be added) winutils & assoc. DLLs. Thank you, Bryan -Original Message- From: "Akhil Das" <ak...@sigmoidanalytics.com> Sent: ‎9/‎14/‎2015 6:46 AM To: "Bryan Jeffrey" <bryan.je

RE: Yarn Shutting Down Spark Processing

2015-09-23 Thread Bryan
Tathagata, Simple batch jobs do work. The cluster has a good set of resources and a limited input volume on the given Kafka topic. The job works on the small 3-node standalone-configured cluster I have setup for test. Regards, Bryan Jeffrey -Original Message- From: "Tathagat

RE: Yarn Shutting Down Spark Processing

2015-09-23 Thread Bryan
Also - I double checked - we're setting the master to "yarn-cluster" -Original Message- From: "Tathagata Das" <t...@databricks.com> Sent: ‎9/‎23/‎2015 2:38 PM To: "Bryan" <bryan.jeff...@gmail.com> Cc: "user" <user@spark.apache.org>

RE: Yarn Shutting Down Spark Processing

2015-09-23 Thread Bryan
Marcelo, The error below is from the application logs. The spark streaming context is initialized and actively processing data when yarn claims that the context is not initialized. There are a number of errors, but they're all associated with the ssc shutting down. Regards, Bryan Jeffrey

Kafka Latency

2015-12-21 Thread Bryan
hops) the throughput decreases significantly, causing job delays. Is this typical? Have others encountered similar issues? Is there Kafka configuration that might mitigate this issue? Regards, Bryan Jeffrey Sent from Outlook Mail for Windows 10 phone

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread Bryan
Vivek, https://spark.apache.org/docs/1.5.2/streaming-kafka-integration.html The map is per partitions number of topics to consume. Is numThreads below equal to the number of partitions in your topic? Regards, Bryan Jeffrey Sent from Outlook Mail for Windows 10 phone From: vivek.meghanat

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread Bryan
bs use same group name – is that a problem?   val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap  - Number of threads used here is 1 val searches = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(line => parse(line._2).extract[Search])     Regards,

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-26 Thread Bryan
the message is always reaching Kafka (checked through the console consumer). Regards Vivek Sent using CloudMagic Email On Sat, Dec 26, 2015 at 2:42 am, Bryan <bryan.jeff...@gmail.com> wrote: Agreed. I did not see that they were using the same group name.   Sent from Outlook Mail for Windows 10

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-24 Thread Bryan
while missing data from other partitions. Regards, Bryan Jeffrey Sent from Outlook Mail for Windows 10 phone From: vivek.meghanat...@wipro.com Sent: Thursday, December 24, 2015 5:22 AM To: user@spark.apache.org Subject: Spark Streaming + Kafka + scala job message read issue Hi All, We

RE: Initial State

2015-11-22 Thread Bryan
. Is there an alternative method to initialize state? InputQueueStream joined to window would seem to work, but InputQueueStream does not allow checkpointing Sent from Outlook Mail From: Tathagata Das Sent: Sunday, November 22, 2015 8:01 PM To: Bryan Cc: user Subject: Re: Initial State

RE: DateTime Support - Hive Parquet

2015-11-24 Thread Bryan
? Regards, Bryan Jeffrey Sent from Outlook Mail From: Cheng Lian Sent: Tuesday, November 24, 2015 6:49 AM To: Bryan;user Subject: Re: DateTime Support - Hive Parquet I see, then this is actually irrelevant to Parquet. I guess can support Joda DateTime in Spark SQL reflective schema inference

RE: DateTime Support - Hive Parquet

2015-11-24 Thread Bryan
Cheng, That’s exactly what I was hoping for – native support for writing DateTime objects. As it stands Spark 1.5.2 seems to leave no option but to do manual conversion (to nanos, Timestamp, etc) prior to writing records to hive. Regards, Bryan Jeffrey Sent from Outlook Mail From: Cheng

Re: Timeout Error

2015-04-26 Thread Bryan Cutler
I'm not sure what the expected performance should be for this amount of data, but you could try to increase the timeout with the property spark.akka.timeout to see if that helps. Bryan On Sun, Apr 26, 2015 at 6:57 AM, Deepak Gopalakrishnan dgk...@gmail.com wrote: Hello All, I'm trying

Re: Difference between RandomForestModel and RandomForestClassificationModel

2015-07-30 Thread Bryan Cutler
Hi Praveen, In MLLib, the major difference is that RandomForestClassificationModel makes use of a newer API which utilizes ML pipelines. I can't say for certain if they will produce the same exact result for a given dataset, but I believe they should. Bryan On Wed, Jul 29, 2015 at 12:14 PM

Hive Version

2015-10-28 Thread Bryan Jeffrey
of the Spark documentation, but do not see version specified anywhere - it would be a good addition. Thank you, Bryan Jeffrey

Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan Jeffrey
essible (and not partitioned). Is there a straightforward way to write to partitioned tables using Spark SQL? I understand that the read performance for partitioned data is far better - are there other performance improvements that might be better to use instead of partitioning? Regards, Bryan Jeffrey

Spark SQL Persistent Table - joda DateTime Compatability

2015-10-27 Thread Bryan Jeffrey
me to a persistent Hive table accomplished? Has anyone else run into the same issue? Regards, Bryan Jeffrey

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan Jeffrey
issues (3) When partitioning without maps I see frequent out of memory issues I'll update this email when I've got a more concrete example of problems. Regards, Bryan Jeffrey On Wed, Oct 28, 2015 at 1:33 PM, Susan Zhang <suchenz...@gmail.com> wrote: > Have you tried partitionBy? >

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan Jeffrey
every time. Is this a known issue? Is there a workaround? Regards, Bryan Jeffrey On Wed, Oct 28, 2015 at 3:13 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Susan, > > I did give that a shot -- I'm seeing a number of oddities: > > (1) 'Partition By' appears only accepts

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Bryan Jeffrey
MetadataTypedColumnsetSerDe | | InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat | | OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat | This seems like a pretty big bug associated with persistent tables. Am I missing a step somewhere? Thank you, Bryan Jeffrey On Wed, Oct 28, 2015 at 4:10

Re: Error Compiling Spark 1.4.1 w/ Scala 2.11 & Hive Support

2015-10-26 Thread Bryan Jeffrey
All, The error resolved to a bad version of jline pulling from Maven. The jline version is defined as 'scala.version' -- the 2.11 version does not exist in maven. Instead the following should be used: org.scala-lang jline 2.11.0-M3 Regards, Bryan Jeffrey

Error Compiling Spark 1.4.1 w/ Scala 2.11 & Hive Support

2015-10-26 Thread Bryan Jeffrey
All, I'm seeing the following error compiling Spark 1.4.1 w/ Scala 2.11 & Hive support. Any ideas? mvn -Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Phive -Phive-thriftserver package [INFO] Spark Project Parent POM .. SUCCESS [4.124s] [INFO] Spark Launcher

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
master spark://10.0.0.4:7077 --packages com.datastax.spark:spark-cassandra-connector_2.11:1.5.0-M1 --hiveconf "spark.cores.max=2" --hiveconf "spark.executor.memory=2g" Do I perhaps need to include an additional library to do the default conversion? Regards, Bryan Jeffrey On Th

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
Yes, I do - I found your example of doing that later in your slides. Thank you for your help! On Thu, Nov 12, 2015 at 12:20 PM, Mohammed Guller <moham...@glassbeam.com> wrote: > Did you mean Hive or Spark SQL JDBC/ODBC server? > > > > Mohammed > > > > *From:*

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
TIONS ( keyspace "c2", table "detectionresult" ); ]Error: java.io.IOException: Failed to open native connection to Cassandra at {10.0.0.4}:9042 (state=,code=0) This seems to be connecting to local host regardless of the value I set spark.cassandra.connection.host to. Regards,

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
Mohammed, That is great. It looks like a perfect scenario. Would I be able to make the created DF queryable over the Hive JDBC/ODBC server? Regards, Bryan Jeffrey On Wed, Nov 11, 2015 at 9:34 PM, Mohammed Guller <moham...@glassbeam.com> wrote: > Short answer: yes. > > > >

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
Answer: In beeline run the following: SET spark.cassandra.connection.host="10.0.0.10" On Thu, Nov 12, 2015 at 1:13 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Mohammed, > > While you're willing to answer questions, is there a trick to getting the > H

SparkSQL implicit conversion on insert

2015-11-02 Thread Bryan Jeffrey
conversion prior to insertion? Regards, Bryan Jeffrey

Re: Allow multiple SparkContexts in Unit Testing

2015-11-04 Thread Bryan Jeffrey
SparkConf().set("spark.driver.allowMultipleContexts", "true").setAppName(appName).setMaster(master) new StreamingContext(conf, Seconds(seconds)) } Regards, Bryan Jeffrey On Wed, Nov 4, 2015 at 9:49 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Are you trying to sp

Spark Dynamic Partitioning Bug

2015-11-05 Thread Bryan Jeffrey
the manually calculated fields are correct. However, the dynamically calculated (string) partition for idAndSource is a random field from within my case class. I've duplicated this with several other classes and have seen the same result (I use this example because it's very simple). Any idea if this is a known bug? Is there a workaround? Regards, Bryan Jeffrey

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-30 Thread Bryan Jeffrey
Deenar, This worked perfectly - I moved to SQL Server and things are working well. Regards, Bryan Jeffrey On Thu, Oct 29, 2015 at 8:14 AM, Deenar Toraskar <deenar.toras...@gmail.com> wrote: > Hi Bryan > > For your use case you don't need to have multiple metastores. The defa

Re: Example of updateStateByKey with initial RDD?

2015-10-08 Thread Bryan Jeffrey
t the method calls, the function that is called for appears to be the same. I was hoping an example might shed some light on the issue. Regards, Bryan Jeffrey On Thu, Oct 8, 2015 at 7:04 AM, Aniket Bhatnagar <aniket.bhatna...@gmail.com > wrote: > Here is an example: > > val

Re: Example of updateStateByKey with initial RDD?

2015-10-08 Thread Bryan Jeffrey
= initialRDD) > counts.print() > > Thanks, > Aniket > > > On Thu, Oct 8, 2015 at 5:48 PM Bryan Jeffrey <bryan.jeff...@gmail.com> > wrote: > >> Aniket, >> >> Thank you for the example - but that's not quite what I'm looking for. >

Java vs. Scala for Spark

2015-09-08 Thread Bryan Jeffrey
for Spark dev in an enterprise environment? What was the outcome? Regards, Bryan Jeffrey

Re: Java vs. Scala for Spark

2015-09-08 Thread Bryan Jeffrey
Thank you for the quick responses. It's useful to have some insight from folks already extensively using Spark. Regards, Bryan Jeffrey On Tue, Sep 8, 2015 at 10:28 AM, Sean Owen <so...@cloudera.com> wrote: > Why would Scala vs Java performance be different Ted? Relatively &

Getting Started with Spark

2015-09-08 Thread Bryan Jeffrey
Hello. We're getting started with Spark Streaming. We're working to build some unit/acceptance testing around functions that consume DStreams. The current method for creating DStreams is to populate the data by creating an InputDStream: val input = Array(TestDataFactory.CreateEvent(123

Suggested Method for Execution of Periodic Actions

2015-09-16 Thread Bryan Jeffrey
counts to a database. Is there a built in mechanism or established pattern to execute periodic jobs in spark streaming? Regards, Bryan Jeffrey

Re: Weird worker usage

2015-09-28 Thread Bryan Jeffrey
Nukunj, No, I'm not calling set w/ master at all. This ended up being a foolish configuration problem with my slaves file. Regards, Bryan Jeffrey On Fri, Sep 25, 2015 at 11:20 PM, N B <nb.nos...@gmail.com> wrote: > Bryan, > > By any chance, are you calling SparkConf.s

Problems with Local Checkpoints

2015-09-09 Thread Bryan Jeffrey
en something similar? Regards, Bryan Jeffrey

Yarn Shutting Down Spark Processing

2015-09-22 Thread Bryan Jeffrey
need to change to allow Yarn to initialize Spark streaming (vs. batch) jobs? Thank you, Bryan Jeffrey

Re: Weird worker usage

2015-09-25 Thread Bryan Jeffrey
parkcheckpoint --broker kafkaBroker:9092 --topic test --numStreams 9 --threadParallelism 9 Even when I put a long-running job in the queue, none of the other nodes are anything but idle. Am I missing something obvious? Regards, Bryan Jeffrey On Fri, Sep 25, 2015 at 8:28 AM, Akhil Das <a

Re: Weird worker usage

2015-09-25 Thread Bryan Jeffrey
45 INFO SparkContext: Running Spark version 1.4.1 15/09/25 16:45:45 INFO SparkContext: Spark configuration: spark.app.name=MainClass spark.default.parallelism=6 spark.driver.supervise=true spark.jars=file:/tmp/OinkSpark-1.0-SNAPSHOT-jar-with-dependencies.jar spark.logConf=true spark.master=local[*] spark.rpc.askTimeout=10 spark.streaming.receiver.maxRate=500 As you can see, despite -Dmaster=spark://sparkserver:7077, the streaming context still registers the master as local[*]. Any idea why? Thank you, Bryan Jeffrey

Re: SparkStreaming variable scope

2015-12-09 Thread Bryan Cutler
rowid from your code is a variable in the driver, so it will be evaluated once and then only the value is sent to words.map. You probably want to have rowid be a lambda itself, so that it will get the value at the time it is evaluated. For example if I have the following: >>> data =

Re: looking for a easier way to count the number of items in a JavaDStream

2015-12-16 Thread Bryan Cutler
llect().toString()); total += rdd.count(); } } MyFunc f = new MyFunc(); inputStream.foreachRDD(f); // f.total will have the count of all RDDs Hope that helps some! -bryan On Wed, Dec 16, 2015 at 8:37 AM, Bryan Cutler <cutl...@gmail.com> wrote: > Hi Andy, > >

Re: ALS mllib.recommendation vs ml.recommendation

2015-12-15 Thread Bryan Cutler
Hi Roberto, 1. How do they differ in terms of performance? They both use alternating least squares matrix factorization, the main difference is ml.recommendation.ALS uses DataFrames as input which has built-in optimizations and should give better performance 2. Am I correct to assume

Re: Hive error after update from 1.4.1 to 1.5.2

2015-12-16 Thread Bryan Jeffrey
I had a bunch of library dependencies that were still using Scala 2.10 versions. I updated them to 2.11 and everything has worked fine since. On Wed, Dec 16, 2015 at 3:12 AM, Ashwin Sai Shankar <ashan...@netflix.com> wrote: > Hi Bryan, > I see the same issue with 1.5.2, can you pls

Re: looking for a easier way to count the number of items in a JavaDStream

2015-12-16 Thread Bryan Cutler
Hi Andy, Regarding the foreachrdd return value, this Jira that will be in 1.6 should take care of that https://issues.apache.org/jira/browse/SPARK-4557 and make things a little simpler. On Dec 15, 2015 6:55 PM, "Andy Davidson" wrote: > I am writing a JUnit test

Re: SparkContext SyntaxError: invalid syntax

2016-01-08 Thread Bryan Cutler
Hi Andrew, I know that older versions of Spark could not run PySpark on YARN in cluster mode. I'm not sure if that is fixed in 1.6.0 though. Can you try setting deploy-mode option to "client" when calling spark-submit? Bryan On Thu, Jan 7, 2016 at 2:39 PM, weineran <

Re: error writing to stdout

2016-01-06 Thread Bryan Cutler
This is a known issue https://issues.apache.org/jira/browse/SPARK-9844. As Noorul said, it is probably safe to ignore as the executor process is already destroyed at this point. On Mon, Dec 21, 2015 at 8:54 PM, Noorul Islam K M wrote: > carlilek

Re: Working with RDD from Java

2015-11-17 Thread Bryan Cutler
/scala/org/apache/spark/mllib/clustering/LDAModel.scala#L350 -bryan On Tue, Nov 17, 2015 at 3:06 AM, frula00 <i...@crossing-technologies.com> wrote: > Hi, > I'm working in Java, with Spark 1.3.1 - I am trying to extract data from > the

Hive error after update from 1.4.1 to 1.5.2

2015-11-20 Thread Bryan Jeffrey
Hello. I'm seeing an error creating a Hive Context moving from Spark 1.4.1 to 1.5.2. Has anyone seen this issue? I'm invoking the following: new HiveContext(sc) // sc is a Spark Context I am seeing the following error: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding

Re: Hive error after update from 1.4.1 to 1.5.2

2015-11-20 Thread Bryan Jeffrey
The 1.5.2 Spark was compiled using the following options: mvn -Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Phive -Phive-thriftserver clean package Regards, Bryan Jeffrey On Fri, Nov 20, 2015 at 2:13 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Hello. > > I'm

Re: Hive error after update from 1.4.1 to 1.5.2

2015-11-20 Thread Bryan Jeffrey
Nevermind. I had a library dependency that still had the old Spark version. On Fri, Nov 20, 2015 at 2:14 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > The 1.5.2 Spark was compiled using the following options: mvn > -Dhadoop.version=2.6.1 -Dscala-2.11 -DskipTests -Pyarn -Ph

DateTime Support - Hive Parquet

2015-11-23 Thread Bryan Jeffrey
with 1.5.2 - however, I am still seeing the associated errors. Is there a bug I can follow to determine when DateTime will be supported for Parquet? Regards, Bryan Jeffrey

Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
Hello. I am looking at the option of moving RDD based operations to Dataset based operations. We are calling 'reduceByKey' on some pair RDDs we have. What would the equivalent be in the Dataset interface - I do not see a simple reduceByKey replacement. Regards, Bryan Jeffrey

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
It would also be nice if there was a better example of joining two Datasets. I am looking at the documentation here: http://spark.apache.org/docs/latest/sql-programming-guide.html. It seems a little bit sparse - is there a better documentation source? Regards, Bryan Jeffrey On Tue, Jun 7, 2016

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
All, Thank you for the replies. It seems as though the Dataset API is still far behind the RDD API. This is unfortunate as the Dataset API potentially provides a number of performance benefits. I will move to using it in a more limited set of cases for the moment. Thank you! Bryan Jeffrey

Re: Specify node where driver should run

2016-06-06 Thread Bryan Cutler
In that mode, it will run on the application master, whichever node that is as specified in your yarn conf. On Jun 5, 2016 4:54 PM, "Saiph Kappa" wrote: > Hi, > > In yarn-cluster mode, is there any way to specify on which node I want the > driver to run? > > Thanks. >

Re: Specify node where driver should run

2016-06-06 Thread Bryan Cutler
plication master should run in the yarn > conf? I haven't found any useful information regarding that. > > Thanks. > > On Mon, Jun 6, 2016 at 4:52 PM, Bryan Cutler <cutl...@gmail.com> wrote: > >> In that mode, it will run on the application master, whichever node th

Re: Kafka Exceptions

2016-06-13 Thread Bryan Jeffrey
Cody, We already set the maxRetries. We're still seeing issue - when leader is shifted, for example, it does not appear that direct stream reader correctly handles this. We're running 1.6.1. Bryan Jeffrey On Mon, Jun 13, 2016 at 10:37 AM, Cody Koeninger <c...@koeninger.org> wrote:

Kafka Exceptions

2016-06-13 Thread Bryan Jeffrey
you, Bryan Jeffrey

Re: Multinomial regression with spark.ml version of LogisticRegression

2016-05-29 Thread Bryan Cutler
This is currently being worked on, planned for 2.1 I believe https://issues.apache.org/jira/browse/SPARK-7159 On May 28, 2016 9:31 PM, "Stephen Boesch" wrote: > Thanks Phuong But the point of my post is how to achieve without using > the deprecated the mllib pacakge. The

Re: LogisticRegression.scala ERROR, require(Predef.scala)

2016-06-23 Thread Bryan Cutler
The stack trace you provided seems to hint that you are calling "predict" on an RDD with Vectors that are not the same size as the number of features in your trained model, they should be equal. If that's not the issue, it would be easier to troubleshoot if you could share your code and possibly

Re: SparkContext SyntaxError: invalid syntax

2016-01-13 Thread Bryan Cutler
bmit --master yarn --deploy-mode client --driver-memory 4g --executor-memory 2g --executor-cores 1 ./examples/src/main/python/pi.py 10* That is a good sign that local jobs and Java examples work, probably just a small configuration issue :) Bryan On Wed, Jan 13, 2016 at

Re: SparkContext SyntaxError: invalid syntax

2016-01-15 Thread Bryan Cutler
Glad you got it going! It's wasn't very obvious what needed to be set, maybe it is worth explicitly stating this in the docs since it seems to have come up a couple times before too. Bryan On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner < andrewweiner2...@u.northwestern.edu> wrote: >

Re: Random Forest FeatureImportance throwing NullPointerException

2016-01-14 Thread Bryan Cutler
If you are able to just train the RandomForestClassificationModel from ML directly instead of training the old model and converting, then that would be the way to go. On Thu, Jan 14, 2016 at 2:21 PM, <rachana.srivast...@thomsonreuters.com> wrote: > Thanks so much Bryan for your

Re: Random Forest FeatureImportance throwing NullPointerException

2016-01-14 Thread Bryan Cutler
solved as part of this JIRA https://issues.apache.org/jira/browse/SPARK-12183 Bryan On Thu, Jan 14, 2016 at 8:12 AM, Rachana Srivastava < rachana.srivast...@markmonitor.com> wrote: > Tried using 1.6 version of Spark that takes numberOfFeatures fifth > argument in the API but s

Re: Spark with .NET

2016-02-09 Thread Bryan Jeffrey
Arko, Check this out: https://github.com/Microsoft/SparkCLR This is a Microsoft authored C# language binding for Spark. Regards, Bryan Jeffrey On Tue, Feb 9, 2016 at 3:13 PM, Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > Doesn't seem to be supported, but

Re: Access batch statistics in Spark Streaming

2016-02-08 Thread Bryan Jeffrey
>From within a Spark job you can use a Periodic Listener: ssc.addStreamingListener(PeriodicStatisticsListener(Seconds(60))) class PeriodicStatisticsListener(timePeriod: Duration) extends StreamingListener { private val logger = LoggerFactory.getLogger("Application") override def

Error w/ Invertable ReduceByKeyAndWindow

2016-02-01 Thread Bryan Jeffrey
I am sure we're doing consistent hashing. The 'reduceAdd' function is adding to a map. The 'inverseReduceFunction' is subtracting from the map. The filter function is removing items where the number of entries in the map is zero. Has anyone seen this error before? Regards, Bryan Jeffrey

Re: Error w/ Invertable ReduceByKeyAndWindow

2016-02-01 Thread Bryan Jeffrey
Excuse me - I should have mentioned: I am running Spark 1.4.1, Scala 2.11. I am running in streaming mode receiving data from Kafka. Regards, Bryan Jeffrey On Mon, Feb 1, 2016 at 9:19 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Hello. > > I have a reduceByKeyAnd

Re: Spark Streaming - processing/transforming DStreams using a custom Receiver

2016-02-25 Thread Bryan Cutler
Using flatmap on a string will treat it as a sequence, which is why you are getting an RDD of char. I think you want to just do a map instead. Like this val timestamps = stream.map(event => event.getCreatedAt.toString) On Feb 25, 2016 8:27 AM, "Dominik Safaric" wrote:

Re: LDA topic Modeling spark + python

2016-02-29 Thread Bryan Cutler
ek.mis...@xerox.com> wrote: > Hello Bryan, > > > > Thank you for the update on Jira. I took your code and tried with mine. > But I get an error with the vector being created. Please see my code below > and suggest me. > > My input file has some conte

Re: LDA topic Modeling spark + python

2016-02-25 Thread Bryan Cutler
I'm not exactly sure how you would like to setup your LDA model, but I noticed there was no Python example for LDA in Spark. I created this issue to add it https://issues.apache.org/jira/browse/SPARK-13500. Keep an eye on this if it could be of help. bryan On Wed, Feb 24, 2016 at 8:34 PM

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread Bryan Cutler
Could you elaborate where the issue is? You say calling model.latestModel.clusterCenters.foreach(println) doesn't show an updated model, but that is just a single statement to print the centers once.. Also, is there any reason you don't predict on the test data like this?

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread Bryan Cutler
Can you share more of your code to reproduce this issue? The model should be updated with each batch, but can't tell what is happening from what you posted so far. On Fri, Feb 19, 2016 at 10:40 AM, krishna ramachandran <ram...@s1776.com> wrote: > Hi Bryan > Agreed. It is a sing

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread Bryan Cutler
>> >> [4,1, 3083.2778025] >> >> [2, 4, 6226.40232139] >> >> [1, 2, 785.84266] >> >> [5, 1, 6706.05424139] >> >> >> >> and monitor. please let know if I missed something >> >> Krishna >> >> &g

Re: Spark job for Reading time series data from Cassandra

2016-03-10 Thread Bryan Jeffrey
Prateek, I believe that one task is created per Cassandra partition. How is your data partitioned? Regards, Bryan Jeffrey On Thu, Mar 10, 2016 at 10:36 AM, Prateek . <prat...@aricent.com> wrote: > Hi, > > > > I have a Spark Batch job for reading timeseries data from

Re: Get output of the ALS algorithm.

2016-03-15 Thread Bryan Cutler
/scala/org/apache/spark/examples/mllib/RecommendationExample.scala#L62 On Fri, Mar 11, 2016 at 8:18 PM, Shishir Anshuman <shishiranshu...@gmail.com > wrote: > The model produced after training. > > On Fri, Mar 11, 2016 at 10:29 PM, Bryan Cutler <cutl...@gmail.com> wrote: > &

Re: Get output of the ALS algorithm.

2016-03-11 Thread Bryan Cutler
Are you trying to save predictions on a dataset to a file, or the model produced after training with ALS? On Thu, Mar 10, 2016 at 7:57 PM, Shishir Anshuman wrote: > hello, > > I am new to Apache Spark and would like to get the Recommendation output > of the ALS

Re: OOM Exception in my spark streaming application

2016-03-14 Thread Bryan Jeffrey
Steve & Adam, I would be interesting in hearing the outcome here as well. I am seeing some similar issues in my 1.4.1 pipeline, using stateful functions (reduceByKeyAndWindow and updateStateByKey). Regards, Bryan Jeffrey On Mon, Mar 14, 2016 at 6:45 AM, Steve Loughran <ste...@hortonwo

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Bryan Jeffrey
Cody et. al, I am seeing a similar error. I've increased the number of retries. Once I've got a job up and running I'm seeing it retry correctly. However, I am having trouble getting the job started - number of retries does not seem to help with startup behavior. Thoughts? Regards, Bryan

Suggested Method to Write to Kafka

2016-03-01 Thread Bryan Jeffrey
Hello. Is there a suggested method and/or some example code to write results from a Spark streaming job back to Kafka? I'm using Scala and Spark 1.4.1. Regards, Bryan Jeffrey

Issues with Long Running Streaming Application

2016-04-25 Thread Bryan Jeffrey
, Bryan Jeffrey

Re: Spark 1.6.1. How to prevent serialization of KafkaProducer

2016-04-21 Thread Bryan Jeffrey
ns = kafkaWritePartitions) detectionWriter.write(dataToWriteToKafka) Hope that helps! Bryan Jeffrey On Thu, Apr 21, 2016 at 2:08 PM, Alexander Gallego <agall...@concord.io> wrote: > Thanks Ted. > > KafkaWordCount (producer) does not operate on a DStream[T

Streaming application slows over time

2016-05-09 Thread Bryan Jeffrey
for or known bugs in similar instances? Regards, Bryan Jeffrey

Spark 2.0

2016-07-25 Thread Bryan Jeffrey
be willing to go fix it myself). Should I just create a ticket? Thank you, Bryan Jeffrey

Re: Spark 2.0 - JavaAFTSurvivalRegressionExample doesn't work

2016-07-28 Thread Bryan Cutler
That's the correct fix. I have this done along with a few other Java examples that still use the old MLlib Vectors in this PR thats waiting for review https://github.com/apache/spark/pull/14308 On Jul 28, 2016 5:14 AM, "Robert Goodman" wrote: > I changed import in the sample

Re: Programmatic use of UDFs from Java

2016-07-22 Thread Bryan Cutler
Everett, I had the same question today and came across this old thread. Not sure if there has been any more recent work to support this. http://apache-spark-developers-list.1001551.n3.nabble.com/Using-UDFs-in-Java-without-registration-td12497.html On Thu, Jul 21, 2016 at 10:10 AM, Everett

Re: MLlib, Java, and DataFrame

2016-07-21 Thread Bryan Cutler
ML has a DataFrame based API, while MLlib is RDDs and will be deprecated as of Spark 2.0. On Thu, Jul 21, 2016 at 10:41 PM, VG <vlin...@gmail.com> wrote: > Why do we have these 2 packages ... ml and mlib? > What is the difference in these > > > > On Fri, Jul 22, 2016 a

Re: MLlib, Java, and DataFrame

2016-07-21 Thread Bryan Cutler
Hi JG, If you didn't know this, Spark MLlib has 2 APIs, one of which uses DataFrames. Take a look at this example https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java This example uses a Dataset, which is

Event Log Compression

2016-07-26 Thread Bryan Jeffrey
? Thank you, Bryan Jeffrey

Re: Why training data in Kmeans Spark streaming clustering

2016-08-11 Thread Bryan Cutler
The algorithm update is just broken into 2 steps: trainOn - to learn/update the cluster centers, and predictOn - predicts cluster assignment on data The StreamingKMeansExample you reference breaks up data into training and test because you might want to score the predictions. If you don't care

Re: Grid Search using Spark MLLib Pipelines

2016-08-12 Thread Bryan Cutler
You will need to cast bestModel to include the MLWritable trait. The class Model does not mix it in by default. For instance: cvModel.bestModel.asInstanceOf[MLWritable].save("/my/path") Alternatively, you could save the CV model directly, which takes care of this cvModel.save("/my/path") On

Re: spark-submit local and Akka startup timeouts

2016-07-18 Thread Bryan Cutler
Hi Rory, for starters what version of Spark are you using? I believe that in a 1.5.? release (I don't know which one off the top of my head) there was an addition that would also display the config property when a timeout happened. That might help some if you are able to upgrade. On Jul 18,

Re: spark-submit local and Akka startup timeouts

2016-07-19 Thread Bryan Cutler
you might have luck trying a more recent version of Spark, such as 1.6.2 or even 2.0.0 (soon to be released) which no longer uses Akka and the ActorSystem. Hope that helps! On Tue, Jul 19, 2016 at 2:29 AM, Rory Waite <rwa...@sdl.com> wrote: > Sorry Bryan, I should have mentioned tha

  1   2   >