Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Chris Fregly
loud-based or on-premise environment in minutes. we support aws, google cloud, and azure - basically anywhere that runs docker. any time zone works. we're completely global with free 24x7 support for everyone in the community. thanks! hope this is useful. Chris Fregly Research Scientist @

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-18 Thread Chris Fregly
use the Pivot Table feature of data frame which is available >>> since Spark 1.6. But the spark of current cluster is version 1.5. Can we >>> install Spark 2.0 on the master node to work around this? >>> >>> Thanks! >>> >> >> >> -- >

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Chris Fregly
are of the conditions placed on >> the package. If you find that the general public are downloading such test >> packages, then remove them. >> > > On Tue, Aug 9, 2016 at 11:32 AM, Chris Fregly <ch...@fregly.com> wrote: > >> this is a valid question.

Re: Spark 2.0.1 / 2.1.0 on Maven

2016-08-09 Thread Chris Fregly
2.0.0 release, >>> specifically on Maven + Java, what steps should we take to upgrade? I can't >>> find the newer versions on Maven central. >>> >>> Thank you! >>> Jestin >>> >> >> > -- *Chris Fregly* Research Scientist @ PipelineIO San Francisco, CA pipeline.io advancedspark.com

Re: ml models distribution

2016-07-22 Thread Chris Fregly
eant "Machine Learning". I thought is a >> > quite spread acronym. Sorry for the possible confusion. >> > >> > >> > -- >> > Sergio Fernández >> > Partner Technology Manager >> > Redlink GmbH >> > m: +43 6602747925 >> > e: sergio.fernan...@redlink.co >> > w: http://redlink.co >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > -- *Chris Fregly* Research Scientist @ PipelineIO San Francisco, CA pipeline.io advancedspark.com

Re: [Spark 2.0.0] Structured Stream on Kafka

2016-06-14 Thread Chris Fregly
wrote: >> > Heya folks, >> > >> > Just wondering if there are some doc regarding using kafka directly >> from the >> > reader.stream? >> > Has it been integrated already (I mean the source)? >> > >> > Sorry if the answer is RTFM (but

Re: Book for Machine Learning (MLIB and other libraries on Spark)

2016-06-12 Thread Chris Fregly
sr_1_3?ie=UTF8=1465657706=8-3=spark+mllib >>>>> >>>>> >>>>> https://www.amazon.com/Advanced-Analytics-Spark-Patterns-Learning/dp/1491912766/ref=sr_1_2?ie=UTF8=1465657706=8-2=spark+mllib >>>>> >>>>> >>>>> On Sat, Jun 11, 2016 at 8:04 AM, Deepak Goel <deic...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> Hey >>>>>> >>>>>> Namaskara~Nalama~Guten Tag~Bonjour >>>>>> >>>>>> I am a newbie to Machine Learning (MLIB and other libraries on Spark) >>>>>> >>>>>> Which would be the best book to learn up? >>>>>> >>>>>> Thanks >>>>>> Deepak >>>>>>-- >>>>>> Keigu >>>>>> >>>>>> Deepak >>>>>> 73500 12833 >>>>>> www.simtree.net, dee...@simtree.net >>>>>> deic...@gmail.com >>>>>> >>>>>> LinkedIn: www.linkedin.com/in/deicool >>>>>> Skype: thumsupdeicool >>>>>> Google talk: deicool >>>>>> Blog: http://loveandfearless.wordpress.com >>>>>> Facebook: http://www.facebook.com/deicool >>>>>> >>>>>> "Contribute to the world, environment and more : >>>>>> http://www.gridrepublic.org >>>>>> " >>>>>> >>>>> >>>>> >>>> >>> >> > -- *Chris Fregly* Research Scientist @ PipelineIO San Francisco, CA http://pipeline.io

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Chris Fregly
gt;>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-2.3.2.jar >>>>>> >>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-spark_2.11-2.3.2.jar >>>>>> >>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-mr-2.3.2.jar >>>>>> >>>>>> at org.elasticsearch.hadoop.util.Version.(Version.java:73) >>>>>> at >>>>>> org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:376) >>>>>> at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40) >>>>>> at >>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67) >>>>>> at >>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67) >>>>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) >>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:89) >>>>>> at >>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>> >>>>>> .. still tracking this down but was wondering if there is someting >>>>>> obvious I'm dong wrong. I'm going to take out >>>>>> elasticsearch-hadoop-2.3.2.jar and try again. >>>>>> >>>>>> Lots of trial and error here :-/ >>>>>> >>>>>> Kevin >>>>>> >>>>>> -- >>>>>> >>>>>> We’re hiring if you know of any awesome Java Devops or Linux >>>>>> Operations Engineers! >>>>>> >>>>>> Founder/CEO Spinn3r.com >>>>>> Location: *San Francisco, CA* >>>>>> blog: http://burtonator.wordpress.com >>>>>> … or check out my Google+ profile >>>>>> <https://plus.google.com/102718274791889610666/posts> >>>>>> >>>>>> >>>> >>>> >>>> -- >>>> >>>> We’re hiring if you know of any awesome Java Devops or Linux Operations >>>> Engineers! >>>> >>>> Founder/CEO Spinn3r.com >>>> Location: *San Francisco, CA* >>>> blog: http://burtonator.wordpress.com >>>> … or check out my Google+ profile >>>> <https://plus.google.com/102718274791889610666/posts> >>>> >>>> >> >> >> -- >> >> We’re hiring if you know of any awesome Java Devops or Linux Operations >> Engineers! >> >> Founder/CEO Spinn3r.com >> Location: *San Francisco, CA* >> blog: http://burtonator.wordpress.com >> … or check out my Google+ profile >> <https://plus.google.com/102718274791889610666/posts> >> >> -- *Chris Fregly* Research Scientist @ PipelineIO San Francisco, CA http://pipeline.io

Re: GraphX Java API

2016-05-30 Thread Chris Fregly
confidential information > intended for a specific individual and purpose, and is protected by law. If > you are not the intended recipient, you should delete this message and any > disclosure, copying, or distribution of this message, or the taking of any > action based on it, by you is strictly prohibited. > v.E.1 > > > > > > > > > -- > --- > Takeshi Yamamuro > > > > -- *Chris Fregly* Research Scientist @ Flux Capacitor AI "Bringing AI Back to the Future!" San Francisco, CA http://fluxcapacitor.ai

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Chris Fregly
t you start when you start the application with spark-shell or >>>>>>>>> spark-submit in the host where the cluster manager is running. >>>>>>>>> >>>>>>>>> The Driver node runs on the same host that the cluster manager

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Chris Fregly
24 hours you >> suggest). >> >> On Tue, May 17, 2016 at 1:36 AM, Todd <bit1...@163.com> wrote: >> >>> Hi, >>> We have a requirement to do count(distinct) in a processing batch >>> against all the streaming data(eg, last 24 hours' data),that

Re: Any NLP lib could be used on spark?

2016-04-20 Thread Chris Fregly
this took me a bit to get working, but I finally got it up and running so with the package that Burak pointed out. here's some relevant links to my project that should give you some clues:

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Chris Fregly
perhaps renaming to Spark ML would actually clear up code and documentation confusion? +1 for rename > On Apr 5, 2016, at 7:00 PM, Reynold Xin wrote: > > +1 > > This is a no brainer IMO. > > >> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley

Re: Spark for Log Analytics

2016-03-31 Thread Chris Fregly
with production-ready Kafka Streams, so I can try this out myself - and hopefully remove a lot of extra plumbing. On Thu, Mar 31, 2016 at 4:42 AM, Chris Fregly <ch...@fregly.com> wrote: > this is a very common pattern, yes. > > note that in Netflix's case, they're currently pushing al

Re: Spark for Log Analytics

2016-03-31 Thread Chris Fregly
nd then sink the processed logs to both elastic search and > kafka. So that Spark Streaming can pick data from Kafka for the complex use > cases, while logstash filters can be used for the simpler use cases. > > I was wondering if someone has already done this evaluation and could > p

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly
gh for analytical queries > that the OP wants; and MySQL is great but not scalable. Probably > something like VectorWise, HANA, Vertica would work well, but those > are mostly not free solutions. Druid could work too if the use case > is right. > > Anyways, great discussion!

Re: [MLlib - ALS] Merging two Models?

2016-03-10 Thread Chris Fregly
ately, I'm fairly ignorant as to the internal mechanics of ALS > > itself. Is what I'm asking possible? > > > > Thank you, > > Colin > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly
rrently tracked by the DAGScheduler, which > will be apportioning the Tasks from those concurrent Jobs across the > available Executor cores. > > On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <ch...@fregly.com> wrote: > >> Good stuff, Evan. Looks like this is utilizing t

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly
user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.or

Re: Using netlib-java in Spark 1.6 on linux

2016-03-04 Thread Chris Fregly
apache-spark-user-list.1001560.n3.nabble.com/Using-netlib-java-in-Spark-1-6-on-linux-tp26386p26392.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Chris Fregly
t; Pardon the dumb thumb typos :) >> >> On Feb 28, 2016, at 1:48 PM, Ashok Kumar <ashok34...@yahoo.com.INVALID >> <ashok34...@yahoo.com.invalid>> wrote: >> >> Hi Gurus, >> >> Appreciate if you recommend me a good book on Spark or documentatio

Re: Evaluating spark streaming use case

2016-02-21 Thread Chris Fregly
good catch on the cleaner.ttl @jatin- when you say "memory-persisted RDD", what do you mean exactly? and how much data are you talking about? remember that spark can evict these memory-persisted RDDs at any time. they can be recovered from Kafka, but this is not a good situation to be in.

Re: Communication between two spark streaming Job

2016-02-19 Thread Chris Fregly
if you need update notifications, you could introduce ZooKeeper (eek!) or a Kafka queue between the jobs. I've seen internal Kafka queues (relative to external spark streaming queues) used for this type of incremental update use case. think of the updates as transaction logs. > On Feb 19,

Re: Allowing parallelism in spark local mode

2016-02-12 Thread Chris Fregly
he time in a meaningful way. > > Am I missing something obvious? > thanks, Yael > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Do we need to enabled Tungsten sort in Spark 1.6?

2016-01-08 Thread Chris Fregly
it is >>>> Tungsten no? Please guide. >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Do-we-need-to-enabled-Tungsten-sort-in-Spark-1-6-tp25923.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> - >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >> > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Date Time Regression as Feature

2016-01-08 Thread Chris Fregly
d not fine any example with value prediction where features had > dates in it. > > Thanks > > Jorge Machado > > Jorge Machado > jo...@jmachado.me > > > - > To unsubscribe, e-mail: user-unsubscr.

Re: how to extend java transformer from Scala UnaryTransformer ?

2016-01-02 Thread Chris Fregly
t, List>() { > > public List apply(List words) { > > for(String word : words) { > > logger.error("AEDWIP input word: {}", word); > > } > > return words; > > } > > }; > > > > return f; > > } > > > @Override > > public DataType outputDataType() { > > return DataTypes.createArrayType(DataTypes.StringType, true); > > } > > } > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: How to specify the numFeatures in HashingTF

2016-01-02 Thread Chris Fregly
>> Hi, >>> >>> There is a parameter in the HashingTF called "numFeatures". I was >>> wondering what is the best way to set the value to this parameter. In the >>> use case of text categorization, do you need to know in advance the number >>> of w

Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-30 Thread Chris Fregly
ave an RDD that has been processed, and persisted >> to memory-only. I want to be able to run a count (actually >> "countApproxDistinct") after filtering by an, at compile time, unknown >> (specified by query) value. >> > >> > I've experimented with

Re: 回复: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-30 Thread Chris Fregly
@Jim- I'm wondering if those docs are outdated as its my understanding (please correct if I'm wrong), that we should never be seeing OOMs as 1.5/Tungsten not only improved (reduced) the memory footprint of our data, but also introduced better task level - and even key level - external spilling

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Chris Fregly
are the credentials visible from each Worker node to all the Executor JVMs on each Worker? > On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev > wrote: > > Dear Spark community, > > I faced the following issue with trying accessing data on S3a, my code is

Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Chris Fregly
hem? > > Thank you, > Konstantin Kudryavtsev > > On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly <ch...@fregly.com> wrote: > >> are the credentials visible from each Worker node to all the Executor >> JVMs on each Worker? >> >> On Dec 30, 2015, at 12:45 PM, K

Re: Using Experminal Spark Features

2015-12-30 Thread Chris Fregly
s anyone used > this approach yet and if so what has you experience been with using it? If > it helps we’d be looking to implement it using Scala. Secondly, in general > what has people experience been with using experimental features in Spark? > > > > Cheers, > > >

Re: SparkSQL Hive orc snappy table

2015-12-30 Thread Chris Fregly
rmat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:927) >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:836) >> at >> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:702) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> I will be glad for any help on that matter. >> >> Regards >> Dawid Wysakowicz >> > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Problem with WINDOW functions?

2015-12-29 Thread Chris Fregly
on quick glance, it appears that you're calling collect() in there which is bringing down a huge amount of data down to the single Driver. this is why, when you allocated more memory to the Driver, a different error emerges most -definitely related to stop-the-world GC to cause the node to

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Chris Fregly
to use the new API >> once it's available. >> >> >> On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot <divya.htco...@gmail.com> >> wrote: >> >>> Hi, >>> I am new bee to spark and a bit confused about RDDs and DataFames in >>> Spark. >

Re: Help: Driver OOM when shuffle large amount of data

2015-12-28 Thread Chris Fregly
which version of spark is this? is there any chance that a single key - or set of keys- key has a large number of values relative to the other keys (aka. skew)? if so, spark 1.5 *should* fix this issue with the new tungsten stuff, although I had some issues still with 1.5.1 in a similar

Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Chris Fregly
results.col("labelIndex"), >> >> >> results.col("prediction"), >> >> >> results.col("words")); >> >> exploreDF.show(10)

Re: fishing for help!

2015-12-25 Thread Chris Fregly
ff etc etc >>> >>> >>> On 21 December 2015 at 14:53, Eran Witkon <eranwit...@gmail.com> wrote: >>> >>>> Hi, >>>> I know it is a wide question but can you think of reasons why a pyspark >>>> job which runs on from server 1 usi

Re: Memory allocation for Broadcast values

2015-12-25 Thread Chris Fregly
ery executor, or is >> there something other than driver and executor memory I can size? >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >

Re: error while defining custom schema in Spark 1.5.0

2015-12-25 Thread Chris Fregly
e.spark.sql.types.StructType > > (fields: > Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType > cannot be applied to (org.apache.spark.sql.types.StructField, > org.apache.spark.sql.types.StructField, > org.apache.spark.sql.types.StructField, > org.apache.spark.sql.types.StructField, > org.apache.spark.sql.types.StructField, > org.apache.spark.sql.types.StructField) >val customSchema = StructType( StructField("year", IntegerType, > true), StructField("make", StringType, true) ,StructField("model", > StringType, true) , StructField("comment", StringType, true) , > StructField("blank", StringType, true),StructField("blank", StringType, > true)) > ^ >Would really appreciate if somebody share the example which works with > Spark 1.4 or Spark 1.5.0 > > Thanks, > Divya > > ^ > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Tips for Spark's Random Forest slow performance

2015-12-25 Thread Chris Fregly
rk-user-list.1001560.n3.nabble.com/Tips-for-Spark-s-Random-Forest-slow-performance-tp25766.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-25 Thread Chris Fregly
gt;>>> increasing the open file limit (this is done on an os level). In some >>>>> application use-cases it is actually a legitimate need. >>>>> >>>>> If that doesn't help, make sure you close any unused files and streams >>>>> in your code. It will also be easier to help diagnose the issue if you >>>>> send >>>>> an error-reproducing snippet. >>>>> >>>> >>>> >> > > -- > Regards, > Vijay Gharge > > > > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Fat jar can't find jdbc

2015-12-25 Thread Chris Fregly
this is the >>>> problem I run into when I'm working with spark-shell and I forget to >>>> include my classpath with my mysql-connect jar file. >>>> > >>>> > I've tried: >>>> > • Using different versions of mysql-connector-java in my >>>> build.sbt file >>>> >

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Chris Fregly
sqlc.load(filename, >>>>>> "com.epam.parso.spark.ds.DefaultSource"); >>>>>> df.cache(); >>>>>> df.printSchema(); <-- prints the schema perfectly fine! >>>>>> >>>>>> df.show(); <-- Works perfectly fine (shows table >>>>>> with 20 lines)! >>>>>> df.registerTempTable("table"); >>>>>> df.select("select * from table limit 5").show(); <-- gives weird >>>>>> exception >>>>>> >>>>>> Exception is: >>>>>> >>>>>> AnalysisException: cannot resolve 'select * from table limit 5' given >>>>>> input columns VER, CREATED, SOC, SOCC, HLTC, HLGTC, STATUS >>>>>> >>>>>> I can do a collect on a dataframe, but cannot select any specific >>>>>> columns either "select * from table" or "select VER, CREATED from table". >>>>>> >>>>>> I use spark 1.5.2. >>>>>> The same code perfectly works through Zeppelin 0.5.5. >>>>>> >>>>>> Thanks. >>>>>> -- >>>>>> Be well! >>>>>> Jean Morozov >>>>>> >>>>> >>>>> >>>> >>> >> > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com

Re: Spark SQL 1.5.2 missing JDBC driver for PostgreSQL?

2015-12-25 Thread Chris Fregly
ons" >>>> ); >>>> When I run this against our PostgreSQL server, I get the following >>>> error. >>>> >>>> Error: java.sql.SQLException: No suitable driver found for >>>> jdbc:postgresql:/// >>>> (state=,cod

Re: Getting estimates and standard error using ml.LinearRegression

2015-12-25 Thread Chris Fregly
ficient > > PFB the code snippet > > val lr = new LinearRegression() > lr.setMaxIter(10) > .setRegParam(0.01) > .setFitIntercept(true) > val model= lr.fit(test) > val estimates = model.summary > > > -- > Thanks and Regards > A

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Chris Fregly
, Dec 25, 2015 at 2:17 PM, Chris Fregly <ch...@fregly.com> wrote: > I assume by "The same code perfectly works through Zeppelin 0.5.5" that > you're using the %sql interpreter with your regular SQL SELECT statement, > correct? > > If so, the Zeppelin interpreter is

Re: Tips for Spark's Random Forest slow performance

2015-12-25 Thread Chris Fregly
is problem in the old mllib API so > it might be something in the new ml that I'm missing. > I will dig deeper into the problem after holidays. > > 2015-12-25 16:26 GMT+01:00 Chris Fregly <ch...@fregly.com>: > > so it looks like you're increasing num trees by 5x and yo

Re: Kafka - streaming from multiple topics

2015-12-20 Thread Chris Fregly
separating out your code into separate streaming jobs - especially when there are no dependencies between the jobs - is almost always the best route. it's easier to combine atoms (fusion), then split them (fission). I recommend splitting out jobs along batch window, stream window, and

Re: spark 1.5.2 memory leak? reading JSON

2015-12-20 Thread Chris Fregly
hey Eran, I run into this all the time with Json. the problem is likely that your Json is "too pretty" and extending beyond a single line which trips up the Json reader. my solution is usually to de-pretty the Json - either manually or through an ETL step - by stripping all white space before

Re: Hive error when starting up spark-shell in 1.5.2

2015-12-20 Thread Chris Fregly
hopping on a plane, but check the hive-site.xml that's in your spark/conf directory (or should be, anyway). I believe you can change the root path thru this mechanism. if not, this should give you more info google on. let me know as this comes up a fair amount. > On Dec 19, 2015, at 4:58 PM,

Re: How to do map join in Spark SQL

2015-12-20 Thread Chris Fregly
this type of broadcast should be handled by Spark SQL/DataFrames automatically. this is the primary cost-based, physical-plan query optimization that the Spark SQL Catalyst optimizer supports. in Spark 1.5 and before, you can trigger this optimization by properly setting the

Re: Pyspark SQL Join Failure

2015-12-20 Thread Chris Fregly
how does Spark SQL/DataFrame know that train_users_2.csv has a field named, "id" or anything else domain specific? is there a header? if so, does sc.textFile() know about this header? I'd suggest using the Databricks spark-csv package for reading csv data. there is an option in there to

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Chris Fregly
have you tried to union the 2 streams per the KinesisWordCountASL example https://github.com/apache/spark/blob/branch-1.3/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L120 where 2 streams (against the same Kinesis stream in this case) are created

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Chris Fregly
may be doing weird overwriting checkpoint information of both Kinesis streams into the same DynamoDB table. Either ways, this is going to be fixed in Spark 1.4. On Thu, May 14, 2015 at 4:10 PM, Chris Fregly ch...@fregly.com wrote: have you tried to union the 2 streams per

Re: Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Chris Fregly
hey mike- as you pointed out here from my docs, changing the stream name is sometimes problematic due to the way the Kinesis Client Library manages leases and checkpoints, etc in DynamoDB. I noticed this directly while developing the Kinesis connector which is why I highlighted the issue

Re: Spark Streaming S3 Performance Implications

2015-03-21 Thread Chris Fregly
hey mike! you'll definitely want to increase your parallelism by adding more shards to the stream - as well as spinning up 1 receiver per shard and unioning all the shards per the KinesisWordCount example that is included with the kinesis streaming package.  you'll need more cores (cluster) or

Re: Connection pool in workers

2015-03-01 Thread Chris Fregly
hey AKM! this is a very common problem. the streaming programming guide addresses this issue here, actually: http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#design-patterns-for-using-foreachrdd the tl;dr is this: 1) you want to use foreachPartition() to operate on a whole

Re: Pushing data from AWS Kinesis - Spark Streaming - AWS Redshift

2015-03-01 Thread Chris Fregly
Hey Mike- Great to see you're using the AWS stack to its fullest! I've already created the Kinesis-Spark Streaming connector with examples, documentation, test, and everything. You'll need to build Spark from source with the -Pkinesis-asl profile, otherwise they won't be included in the build.

Re: Out of memory with Spark Streaming

2014-10-30 Thread Chris Fregly
curious about why you're only seeing 50 records max per batch. how many receivers are you running? what is the rate that you're putting data onto the stream? per the default AWS kinesis configuration, the producer can do 1000 PUTs per second with max 50k bytes per PUT and max 1mb per second per

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-10-29 Thread Chris Fregly
, Chris Fregly ch...@fregly.com wrote: this would be awesome. did a jira get created for this? I searched, but didn't find one. thanks! -chris On Tue, Jul 8, 2014 at 1:30 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: Thanks a lot Xiangrui. This will help

Re: Shared variable in Spark Streaming

2014-09-05 Thread Chris Fregly
good question, soumitra. it's a bit confusing. to break TD's code down a bit: dstream.count() is a transformation operation (returns a new DStream), executes lazily, runs in the cluster on the underlying RDDs that come through in that batch, and returns a new DStream with a single element

Re: spark RDD join Error

2014-09-04 Thread Chris Fregly
specifically, you're picking up the following implicit: import org.apache.spark.SparkContext.rddToPairRDDFunctions (in case you're a wildcard-phobe like me) On Thu, Sep 4, 2014 at 5:15 PM, Veeranagouda Mukkanagoudar veera...@gmail.com wrote: Thanks a lot, that fixed the issue :) On Thu,

Re: saveAsSequenceFile for DStream

2014-08-30 Thread Chris Fregly
couple things to add here: 1) you can import the org.apache.spark.streaming.dstream.PairDStreamFunctions implicit which adds a whole ton of functionality to DStream itself. this lets you work at the DStream level versus digging into the underlying RDDs. 2) you can use ssc.fileStream(directory)

Re: data locality

2014-08-30 Thread Chris Fregly
you can view the Locality Level of each task within a stage by using the Spark Web UI under the Stages tab. levels are as follows (in order of decreasing desirability): 1) PROCESS_LOCAL - data was found directly in the executor JVM 2) NODE_LOCAL - data was found on the same node as the executor

Re: Low Level Kafka Consumer for Spark

2014-08-28 Thread Chris Fregly
@bharat- overall, i've noticed a lot of confusion about how Spark Streaming scales - as well as how it handles failover and checkpointing, but we can discuss that separately. there's actually 2 dimensions to scaling here: receiving and processing. *Receiving* receiving can be scaled out by

Re: Kinesis receiver spark streaming partition

2014-08-28 Thread Chris Fregly
great question, wei. this is very important to understand from a performance perspective. and this extends is beyond kinesis - it's for any streaming source that supports shards/partitions. i need to do a little research into the internals to confirm my theory. lemme get back to you! -chris

Re: Parsing Json object definition spanning multiple lines

2014-08-26 Thread Chris Fregly
i've seen this done using mapPartitions() where each partition represents a single, multi-line json file. you can rip through each partition (json file) and parse the json doc as a whole. this assumes you use sc.textFile(path/*.json) or equivalent to load in multiple files at once. each json

Re: Spark-Streaming collect/take functionality.

2014-08-26 Thread Chris Fregly
good suggestion, td. and i believe the optimization that jon.burns is referring to - from the big data mini course - is a step earlier: the sorting mechanism that produces sortedCounts. you can use mapPartitions() to get a top k locally on each partition, then shuffle only (k * # of partitions)

Re: Low Level Kafka Consumer for Spark

2014-08-26 Thread Chris Fregly
great work, Dibyendu. looks like this would be a popular contribution. expanding on bharat's question a bit: what happens if you submit multiple receivers to the cluster by creating and unioning multiple DStreams as in the kinesis example here:

Re: Spark on Yarn: Connecting to Existing Instance

2014-08-21 Thread Chris Fregly
perhaps the author is referring to Spark Streaming applications? they're examples of long-running applications. the application/domain-level protocol still needs to be implemented yourself, as sandy pointed out. On Wed, Jul 9, 2014 at 11:03 AM, John Omernik j...@omernik.com wrote: So how do

Re: Spark Streaming - What does Spark Streaming checkpoint?

2014-08-21 Thread Chris Fregly
The StreamingContext can be recreated from a checkpoint file, indeed. check out the following Spark Streaming source files for details: StreamingContext, Checkpoint, DStream, DStreamCheckpoint, and DStreamGraph. On Wed, Jul 9, 2014 at 6:11 PM, Yan Fang yanfang...@gmail.com wrote: Hi guys,

Re: Task's Scheduler Delay in web ui

2014-08-19 Thread Chris Fregly
Scheduling Delay is the time required to assign a task to an available resource. if you're seeing large scheduler delays, this likely means that other jobs/tasks are using up all of the resources. here's some more info on how to setup Fair Scheduling versus the default FIFO Scheduler:

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-08-19 Thread Chris Fregly
this would be awesome. did a jira get created for this? I searched, but didn't find one. thanks! -chris On Tue, Jul 8, 2014 at 1:30 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: Thanks a lot Xiangrui. This will help. On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng men...@gmail.com

Re: slower worker node in the cluster

2014-08-19 Thread Chris Fregly
perhaps creating Fair Scheduler Pools might help? there's no way to pin certain nodes to a pool, but you can specify minShares (cpu's). not sure if that would help, but worth looking in to. On Tue, Jul 8, 2014 at 7:37 PM, haopu hw...@qilinsoft.com wrote: In a standalone cluster, is there way

Re: Spark Streaming source from Amazon Kinesis

2014-07-22 Thread Chris Fregly
i took this over from parviz. i recently submitted a new PR for Kinesis Spark Streaming support: https://github.com/apache/spark/pull/1434 others have tested it with good success, so give it a whirl! waiting for it to be reviewed/merged. please put any feedback into the PR directly. thanks!

Re: multiple passes in mapPartitions

2014-07-01 Thread Chris Fregly
also, multiple calls to mapPartitions() will be pipelined by the spark execution engine into a single stage, so the overhead is minimal. On Fri, Jun 13, 2014 at 9:28 PM, zhen z...@latrobe.edu.au wrote: Thank you for your suggestion. We will try it out and see how it performs. We think the

Re: Fw: How Spark Choose Worker Nodes for respective HDFS block

2014-07-01 Thread Chris Fregly
yes, spark attempts to achieve data locality (PROCESS_LOCAL or NODE_LOCAL) where possible just like MapReduce. it's a best practice to co-locate your Spark Workers on the same nodes as your HDFS Name Nodes for just this reason. this is achieved through the RDD.preferredLocations() interface

Re: Selecting first ten values in a RDD/partition

2014-06-29 Thread Chris Fregly
as brian g alluded to earlier, you can use DStream.mapPartitions() to return the partition-local top 10 for each partition. once you collect the results from all the partitions, you can do a global top 10 merge sort across all partitions. this leads to a much much-smaller dataset to be shuffled

Re: Reconnect to an application/RDD

2014-06-29 Thread Chris Fregly
Tachyon is another option - this is the off heap StorageLevel specified when persisting RDDs: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.storage.StorageLevel or just use HDFS. this requires subsequent Applications/SparkContext's to reload the data from disk, of

Re: spark streaming question

2014-05-04 Thread Chris Fregly
great questions, weide. in addition, i'd also like to hear more about how to horizontally scale a spark-streaming cluster. i've gone through the samples (standalone mode) and read the documentation, but it's still not clear to me how to scale this puppy out under high load. i assume i add more

Re: Multiple Streams with Spark Streaming

2014-05-03 Thread Chris Fregly
if you want to use true Spark Streaming (not the same as Hadoop Streaming/Piping, as Mayur pointed out), you can use the DStream.union() method as described in the following docs: http://spark.apache.org/docs/0.9.1/streaming-custom-receivers.html

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-03 Thread Chris Fregly
not sure if this directly addresses your issue, peter, but it's worth mentioned a handy AWS EMR utility called s3distcp that can upload a single HDFS file - in parallel - to a single, concatenated S3 file once all the partitions are uploaded. kinda cool. here's some info:

Re: help me

2014-05-03 Thread Chris Fregly
as Mayur indicated, it's odd that you are seeing better performance from a less-local configuration. however, the non-deterministic behavior that you describe is likely caused by GC pauses in your JVM process. take note of the *spark.locality.wait* configuration parameter described here:

Re: function state lost when next RDD is processed

2014-04-13 Thread Chris Fregly
or how about the UpdateStateByKey() operation? https://spark.apache.org/docs/0.9.0/streaming-programming-guide.html the StatefulNetworkWordCount example demonstrates how to keep state across RDDs. On Mar 28, 2014, at 8:44 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Are you referring to

Re: network wordcount example

2014-03-31 Thread Chris Fregly
@eric- i saw this exact issue recently while working on the KinesisWordCount. are you passing local[2] to your example as the MASTER arg versus just local or local[1]? you need at least 2. it's documented as n1 in the scala source docs - which is easy to mistake for n=1. i just ran the