loud-based or on-premise environment
in minutes. we support aws, google cloud, and azure - basically anywhere that
runs docker.
any time zone works. we're completely global with free 24x7 support for
everyone in the community.
thanks! hope this is useful.
Chris Fregly
Research Scientist @
use the Pivot Table feature of data frame which is available
>>> since Spark 1.6. But the spark of current cluster is version 1.5. Can we
>>> install Spark 2.0 on the master node to work around this?
>>>
>>> Thanks!
>>>
>>
>>
>> --
>
are of the conditions placed on
>> the package. If you find that the general public are downloading such test
>> packages, then remove them.
>>
>
> On Tue, Aug 9, 2016 at 11:32 AM, Chris Fregly <ch...@fregly.com> wrote:
>
>> this is a valid question.
2.0.0 release,
>>> specifically on Maven + Java, what steps should we take to upgrade? I can't
>>> find the newer versions on Maven central.
>>>
>>> Thank you!
>>> Jestin
>>>
>>
>>
>
--
*Chris Fregly*
Research Scientist @ PipelineIO
San Francisco, CA
pipeline.io
advancedspark.com
eant "Machine Learning". I thought is a
>> > quite spread acronym. Sorry for the possible confusion.
>> >
>> >
>> > --
>> > Sergio Fernández
>> > Partner Technology Manager
>> > Redlink GmbH
>> > m: +43 6602747925
>> > e: sergio.fernan...@redlink.co
>> > w: http://redlink.co
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
--
*Chris Fregly*
Research Scientist @ PipelineIO
San Francisco, CA
pipeline.io
advancedspark.com
wrote:
>> > Heya folks,
>> >
>> > Just wondering if there are some doc regarding using kafka directly
>> from the
>> > reader.stream?
>> > Has it been integrated already (I mean the source)?
>> >
>> > Sorry if the answer is RTFM (but
sr_1_3?ie=UTF8=1465657706=8-3=spark+mllib
>>>>>
>>>>>
>>>>> https://www.amazon.com/Advanced-Analytics-Spark-Patterns-Learning/dp/1491912766/ref=sr_1_2?ie=UTF8=1465657706=8-2=spark+mllib
>>>>>
>>>>>
>>>>> On Sat, Jun 11, 2016 at 8:04 AM, Deepak Goel <deic...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hey
>>>>>>
>>>>>> Namaskara~Nalama~Guten Tag~Bonjour
>>>>>>
>>>>>> I am a newbie to Machine Learning (MLIB and other libraries on Spark)
>>>>>>
>>>>>> Which would be the best book to learn up?
>>>>>>
>>>>>> Thanks
>>>>>> Deepak
>>>>>>--
>>>>>> Keigu
>>>>>>
>>>>>> Deepak
>>>>>> 73500 12833
>>>>>> www.simtree.net, dee...@simtree.net
>>>>>> deic...@gmail.com
>>>>>>
>>>>>> LinkedIn: www.linkedin.com/in/deicool
>>>>>> Skype: thumsupdeicool
>>>>>> Google talk: deicool
>>>>>> Blog: http://loveandfearless.wordpress.com
>>>>>> Facebook: http://www.facebook.com/deicool
>>>>>>
>>>>>> "Contribute to the world, environment and more :
>>>>>> http://www.gridrepublic.org
>>>>>> "
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
--
*Chris Fregly*
Research Scientist @ PipelineIO
San Francisco, CA
http://pipeline.io
gt;>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-2.3.2.jar
>>>>>>
>>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-spark_2.11-2.3.2.jar
>>>>>>
>>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-mr-2.3.2.jar
>>>>>>
>>>>>> at org.elasticsearch.hadoop.util.Version.(Version.java:73)
>>>>>> at
>>>>>> org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:376)
>>>>>> at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40)
>>>>>> at
>>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
>>>>>> at
>>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
>>>>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>>>>> at
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>>>>> at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>> at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>
>>>>>> .. still tracking this down but was wondering if there is someting
>>>>>> obvious I'm dong wrong. I'm going to take out
>>>>>> elasticsearch-hadoop-2.3.2.jar and try again.
>>>>>>
>>>>>> Lots of trial and error here :-/
>>>>>>
>>>>>> Kevin
>>>>>>
>>>>>> --
>>>>>>
>>>>>> We’re hiring if you know of any awesome Java Devops or Linux
>>>>>> Operations Engineers!
>>>>>>
>>>>>> Founder/CEO Spinn3r.com
>>>>>> Location: *San Francisco, CA*
>>>>>> blog: http://burtonator.wordpress.com
>>>>>> … or check out my Google+ profile
>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>>>> Engineers!
>>>>
>>>> Founder/CEO Spinn3r.com
>>>> Location: *San Francisco, CA*
>>>> blog: http://burtonator.wordpress.com
>>>> … or check out my Google+ profile
>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>
>>>>
>>
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>>
--
*Chris Fregly*
Research Scientist @ PipelineIO
San Francisco, CA
http://pipeline.io
confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, by you is strictly prohibited.
> v.E.1
>
>
>
>
>
>
>
>
> --
> ---
> Takeshi Yamamuro
>
>
>
>
--
*Chris Fregly*
Research Scientist @ Flux Capacitor AI
"Bringing AI Back to the Future!"
San Francisco, CA
http://fluxcapacitor.ai
t you start when you start the application with spark-shell or
>>>>>>>>> spark-submit in the host where the cluster manager is running.
>>>>>>>>>
>>>>>>>>> The Driver node runs on the same host that the cluster manager
24 hours you
>> suggest).
>>
>> On Tue, May 17, 2016 at 1:36 AM, Todd <bit1...@163.com> wrote:
>>
>>> Hi,
>>> We have a requirement to do count(distinct) in a processing batch
>>> against all the streaming data(eg, last 24 hours' data),that
this took me a bit to get working, but I finally got it up and running so with
the package that Burak pointed out.
here's some relevant links to my project that should give you some clues:
perhaps renaming to Spark ML would actually clear up code and documentation
confusion?
+1 for rename
> On Apr 5, 2016, at 7:00 PM, Reynold Xin wrote:
>
> +1
>
> This is a no brainer IMO.
>
>
>> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley
with production-ready Kafka Streams,
so I can try this out myself - and hopefully remove a lot of extra plumbing.
On Thu, Mar 31, 2016 at 4:42 AM, Chris Fregly <ch...@fregly.com> wrote:
> this is a very common pattern, yes.
>
> note that in Netflix's case, they're currently pushing al
nd then sink the processed logs to both elastic search and
> kafka. So that Spark Streaming can pick data from Kafka for the complex use
> cases, while logstash filters can be used for the simpler use cases.
>
> I was wondering if someone has already done this evaluation and could
> p
gh for analytical queries
> that the OP wants; and MySQL is great but not scalable. Probably
> something like VectorWise, HANA, Vertica would work well, but those
> are mostly not free solutions. Druid could work too if the use case
> is right.
>
> Anyways, great discussion!
ately, I'm fairly ignorant as to the internal mechanics of ALS
> > itself. Is what I'm asking possible?
> >
> > Thank you,
> > Colin
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
rrently tracked by the DAGScheduler, which
> will be apportioning the Tasks from those concurrent Jobs across the
> available Executor cores.
>
> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <ch...@fregly.com> wrote:
>
>> Good stuff, Evan. Looks like this is utilizing t
user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.or
apache-spark-user-list.1001560.n3.nabble.com/Using-netlib-java-in-Spark-1-6-on-linux-tp26386p26392.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
t; Pardon the dumb thumb typos :)
>>
>> On Feb 28, 2016, at 1:48 PM, Ashok Kumar <ashok34...@yahoo.com.INVALID
>> <ashok34...@yahoo.com.invalid>> wrote:
>>
>> Hi Gurus,
>>
>> Appreciate if you recommend me a good book on Spark or documentatio
good catch on the cleaner.ttl
@jatin- when you say "memory-persisted RDD", what do you mean exactly? and
how much data are you talking about? remember that spark can evict these
memory-persisted RDDs at any time. they can be recovered from Kafka, but this
is not a good situation to be in.
if you need update notifications, you could introduce ZooKeeper (eek!) or a
Kafka queue between the jobs.
I've seen internal Kafka queues (relative to external spark streaming queues)
used for this type of incremental update use case.
think of the updates as transaction logs.
> On Feb 19,
he time in a meaningful way.
>
> Am I missing something obvious?
> thanks, Yael
>
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
it is
>>>> Tungsten no? Please guide.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Do-we-need-to-enabled-Tungsten-sort-in-Spark-1-6-tp25923.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
d not fine any example with value prediction where features had
> dates in it.
>
> Thanks
>
> Jorge Machado
>
> Jorge Machado
> jo...@jmachado.me
>
>
> -
> To unsubscribe, e-mail: user-unsubscr.
t, List>() {
>
> public List apply(List words) {
>
> for(String word : words) {
>
> logger.error("AEDWIP input word: {}", word);
>
> }
>
> return words;
>
> }
>
> };
>
>
>
> return f;
>
> }
>
>
> @Override
>
> public DataType outputDataType() {
>
> return DataTypes.createArrayType(DataTypes.StringType, true);
>
> }
>
> }
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
>> Hi,
>>>
>>> There is a parameter in the HashingTF called "numFeatures". I was
>>> wondering what is the best way to set the value to this parameter. In the
>>> use case of text categorization, do you need to know in advance the number
>>> of w
ave an RDD that has been processed, and persisted
>> to memory-only. I want to be able to run a count (actually
>> "countApproxDistinct") after filtering by an, at compile time, unknown
>> (specified by query) value.
>> >
>> > I've experimented with
@Jim-
I'm wondering if those docs are outdated as its my understanding (please
correct if I'm wrong), that we should never be seeing OOMs as 1.5/Tungsten not
only improved (reduced) the memory footprint of our data, but also introduced
better task level - and even key level - external spilling
are the credentials visible from each Worker node to all the Executor JVMs on
each Worker?
> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev
> wrote:
>
> Dear Spark community,
>
> I faced the following issue with trying accessing data on S3a, my code is
hem?
>
> Thank you,
> Konstantin Kudryavtsev
>
> On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly <ch...@fregly.com> wrote:
>
>> are the credentials visible from each Worker node to all the Executor
>> JVMs on each Worker?
>>
>> On Dec 30, 2015, at 12:45 PM, K
s anyone used
> this approach yet and if so what has you experience been with using it? If
> it helps we’d be looking to implement it using Scala. Secondly, in general
> what has people experience been with using experimental features in Spark?
>
>
>
> Cheers,
>
>
>
rmat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:927)
>> at
>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:836)
>> at
>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:702)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> I will be glad for any help on that matter.
>>
>> Regards
>> Dawid Wysakowicz
>>
>
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
on quick glance, it appears that you're calling collect() in there which is
bringing down a huge amount of data down to the single Driver. this is why,
when you allocated more memory to the Driver, a different error emerges most
-definitely related to stop-the-world GC to cause the node to
to use the new API
>> once it's available.
>>
>>
>> On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot <divya.htco...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> I am new bee to spark and a bit confused about RDDs and DataFames in
>>> Spark.
>
which version of spark is this?
is there any chance that a single key - or set of keys- key has a large number
of values relative to the other keys (aka. skew)?
if so, spark 1.5 *should* fix this issue with the new tungsten stuff, although
I had some issues still with 1.5.1 in a similar
results.col("labelIndex"),
>>
>>
>> results.col("prediction"),
>>
>>
>> results.col("words"));
>>
>> exploreDF.show(10)
ff etc etc
>>>
>>>
>>> On 21 December 2015 at 14:53, Eran Witkon <eranwit...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> I know it is a wide question but can you think of reasons why a pyspark
>>>> job which runs on from server 1 usi
ery executor, or is
>> there something other than driver and executor memory I can size?
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
e.spark.sql.types.StructType
>
> (fields:
> Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
> cannot be applied to (org.apache.spark.sql.types.StructField,
> org.apache.spark.sql.types.StructField,
> org.apache.spark.sql.types.StructField,
> org.apache.spark.sql.types.StructField,
> org.apache.spark.sql.types.StructField,
> org.apache.spark.sql.types.StructField)
>val customSchema = StructType( StructField("year", IntegerType,
> true), StructField("make", StringType, true) ,StructField("model",
> StringType, true) , StructField("comment", StringType, true) ,
> StructField("blank", StringType, true),StructField("blank", StringType,
> true))
> ^
>Would really appreciate if somebody share the example which works with
> Spark 1.4 or Spark 1.5.0
>
> Thanks,
> Divya
>
> ^
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
rk-user-list.1001560.n3.nabble.com/Tips-for-Spark-s-Random-Forest-slow-performance-tp25766.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@
gt;>>> increasing the open file limit (this is done on an os level). In some
>>>>> application use-cases it is actually a legitimate need.
>>>>>
>>>>> If that doesn't help, make sure you close any unused files and streams
>>>>> in your code. It will also be easier to help diagnose the issue if you
>>>>> send
>>>>> an error-reproducing snippet.
>>>>>
>>>>
>>>>
>>
>
> --
> Regards,
> Vijay Gharge
>
>
>
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
this is the
>>>> problem I run into when I'm working with spark-shell and I forget to
>>>> include my classpath with my mysql-connect jar file.
>>>> >
>>>> > I've tried:
>>>> > • Using different versions of mysql-connector-java in my
>>>> build.sbt file
>>>> >
sqlc.load(filename,
>>>>>> "com.epam.parso.spark.ds.DefaultSource");
>>>>>> df.cache();
>>>>>> df.printSchema(); <-- prints the schema perfectly fine!
>>>>>>
>>>>>> df.show(); <-- Works perfectly fine (shows table
>>>>>> with 20 lines)!
>>>>>> df.registerTempTable("table");
>>>>>> df.select("select * from table limit 5").show(); <-- gives weird
>>>>>> exception
>>>>>>
>>>>>> Exception is:
>>>>>>
>>>>>> AnalysisException: cannot resolve 'select * from table limit 5' given
>>>>>> input columns VER, CREATED, SOC, SOCC, HLTC, HLGTC, STATUS
>>>>>>
>>>>>> I can do a collect on a dataframe, but cannot select any specific
>>>>>> columns either "select * from table" or "select VER, CREATED from table".
>>>>>>
>>>>>> I use spark 1.5.2.
>>>>>> The same code perfectly works through Zeppelin 0.5.5.
>>>>>>
>>>>>> Thanks.
>>>>>> --
>>>>>> Be well!
>>>>>> Jean Morozov
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
--
*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com
ons"
>>>> );
>>>> When I run this against our PostgreSQL server, I get the following
>>>> error.
>>>>
>>>> Error: java.sql.SQLException: No suitable driver found for
>>>> jdbc:postgresql:///
>>>> (state=,cod
ficient
>
> PFB the code snippet
>
> val lr = new LinearRegression()
> lr.setMaxIter(10)
> .setRegParam(0.01)
> .setFitIntercept(true)
> val model= lr.fit(test)
> val estimates = model.summary
>
>
> --
> Thanks and Regards
> A
, Dec 25, 2015 at 2:17 PM, Chris Fregly <ch...@fregly.com> wrote:
> I assume by "The same code perfectly works through Zeppelin 0.5.5" that
> you're using the %sql interpreter with your regular SQL SELECT statement,
> correct?
>
> If so, the Zeppelin interpreter is
is problem in the old mllib API so
> it might be something in the new ml that I'm missing.
> I will dig deeper into the problem after holidays.
>
> 2015-12-25 16:26 GMT+01:00 Chris Fregly <ch...@fregly.com>:
> > so it looks like you're increasing num trees by 5x and yo
separating out your code into separate streaming jobs - especially when there
are no dependencies between the jobs - is almost always the best route. it's
easier to combine atoms (fusion), then split them (fission).
I recommend splitting out jobs along batch window, stream window, and
hey Eran, I run into this all the time with Json.
the problem is likely that your Json is "too pretty" and extending beyond a
single line which trips up the Json reader.
my solution is usually to de-pretty the Json - either manually or through an
ETL step - by stripping all white space before
hopping on a plane, but check the hive-site.xml that's in your spark/conf
directory (or should be, anyway). I believe you can change the root path thru
this mechanism.
if not, this should give you more info google on.
let me know as this comes up a fair amount.
> On Dec 19, 2015, at 4:58 PM,
this type of broadcast should be handled by Spark SQL/DataFrames automatically.
this is the primary cost-based, physical-plan query optimization that the Spark
SQL Catalyst optimizer supports.
in Spark 1.5 and before, you can trigger this optimization by properly setting
the
how does Spark SQL/DataFrame know that train_users_2.csv has a field named,
"id" or anything else domain specific? is there a header? if so, does
sc.textFile() know about this header?
I'd suggest using the Databricks spark-csv package for reading csv data. there
is an option in there to
have you tried to union the 2 streams per the KinesisWordCountASL example
https://github.com/apache/spark/blob/branch-1.3/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L120
where
2 streams (against the same Kinesis stream in this case) are created
may be doing weird
overwriting checkpoint information of both Kinesis streams into the same
DynamoDB table. Either ways, this is going to be fixed in Spark 1.4.
On Thu, May 14, 2015 at 4:10 PM, Chris Fregly ch...@fregly.com wrote:
have you tried to union the 2 streams per
hey mike-
as you pointed out here from my docs, changing the stream name is sometimes
problematic due to the way the Kinesis Client Library manages leases and
checkpoints, etc in DynamoDB.
I noticed this directly while developing the Kinesis connector which is why I
highlighted the issue
hey mike!
you'll definitely want to increase your parallelism by adding more shards to
the stream - as well as spinning up 1 receiver per shard and unioning all the
shards per the KinesisWordCount example that is included with the kinesis
streaming package.
you'll need more cores (cluster) or
hey AKM!
this is a very common problem. the streaming programming guide addresses
this issue here, actually:
http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
the tl;dr is this:
1) you want to use foreachPartition() to operate on a whole
Hey Mike-
Great to see you're using the AWS stack to its fullest!
I've already created the Kinesis-Spark Streaming connector with examples,
documentation, test, and everything. You'll need to build Spark from
source with the -Pkinesis-asl profile, otherwise they won't be included in
the build.
curious about why you're only seeing 50 records max per batch.
how many receivers are you running? what is the rate that you're putting
data onto the stream?
per the default AWS kinesis configuration, the producer can do 1000 PUTs
per second with max 50k bytes per PUT and max 1mb per second per
, Chris Fregly ch...@fregly.com wrote:
this would be awesome. did a jira get created for this? I searched, but
didn't find one.
thanks!
-chris
On Tue, Jul 8, 2014 at 1:30 PM, Rahul Bhojwani
rahulbhojwani2...@gmail.com
wrote:
Thanks a lot Xiangrui. This will help
good question, soumitra. it's a bit confusing.
to break TD's code down a bit:
dstream.count() is a transformation operation (returns a new DStream),
executes lazily, runs in the cluster on the underlying RDDs that come
through in that batch, and returns a new DStream with a single element
specifically, you're picking up the following implicit:
import org.apache.spark.SparkContext.rddToPairRDDFunctions
(in case you're a wildcard-phobe like me)
On Thu, Sep 4, 2014 at 5:15 PM, Veeranagouda Mukkanagoudar
veera...@gmail.com wrote:
Thanks a lot, that fixed the issue :)
On Thu,
couple things to add here:
1) you can import the
org.apache.spark.streaming.dstream.PairDStreamFunctions implicit which adds
a whole ton of functionality to DStream itself. this lets you work at the
DStream level versus digging into the underlying RDDs.
2) you can use ssc.fileStream(directory)
you can view the Locality Level of each task within a stage by using the
Spark Web UI under the Stages tab.
levels are as follows (in order of decreasing desirability):
1) PROCESS_LOCAL - data was found directly in the executor JVM
2) NODE_LOCAL - data was found on the same node as the executor
@bharat-
overall, i've noticed a lot of confusion about how Spark Streaming scales -
as well as how it handles failover and checkpointing, but we can discuss
that separately.
there's actually 2 dimensions to scaling here: receiving and processing.
*Receiving*
receiving can be scaled out by
great question, wei. this is very important to understand from a
performance perspective. and this extends is beyond kinesis - it's for any
streaming source that supports shards/partitions.
i need to do a little research into the internals to confirm my theory.
lemme get back to you!
-chris
i've seen this done using mapPartitions() where each partition represents a
single, multi-line json file. you can rip through each partition (json
file) and parse the json doc as a whole.
this assumes you use sc.textFile(path/*.json) or equivalent to load in
multiple files at once. each json
good suggestion, td.
and i believe the optimization that jon.burns is referring to - from the
big data mini course - is a step earlier: the sorting mechanism that
produces sortedCounts.
you can use mapPartitions() to get a top k locally on each partition, then
shuffle only (k * # of partitions)
great work, Dibyendu. looks like this would be a popular contribution.
expanding on bharat's question a bit:
what happens if you submit multiple receivers to the cluster by creating
and unioning multiple DStreams as in the kinesis example here:
perhaps the author is referring to Spark Streaming applications? they're
examples of long-running applications.
the application/domain-level protocol still needs to be implemented
yourself, as sandy pointed out.
On Wed, Jul 9, 2014 at 11:03 AM, John Omernik j...@omernik.com wrote:
So how do
The StreamingContext can be recreated from a checkpoint file, indeed.
check out the following Spark Streaming source files for details:
StreamingContext, Checkpoint, DStream, DStreamCheckpoint, and DStreamGraph.
On Wed, Jul 9, 2014 at 6:11 PM, Yan Fang yanfang...@gmail.com wrote:
Hi guys,
Scheduling Delay is the time required to assign a task to an available
resource.
if you're seeing large scheduler delays, this likely means that other
jobs/tasks are using up all of the resources.
here's some more info on how to setup Fair Scheduling versus the default
FIFO Scheduler:
this would be awesome. did a jira get created for this? I searched, but
didn't find one.
thanks!
-chris
On Tue, Jul 8, 2014 at 1:30 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com
wrote:
Thanks a lot Xiangrui. This will help.
On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng men...@gmail.com
perhaps creating Fair Scheduler Pools might help? there's no way to pin
certain nodes to a pool, but you can specify minShares (cpu's). not sure
if that would help, but worth looking in to.
On Tue, Jul 8, 2014 at 7:37 PM, haopu hw...@qilinsoft.com wrote:
In a standalone cluster, is there way
i took this over from parviz.
i recently submitted a new PR for Kinesis Spark Streaming support:
https://github.com/apache/spark/pull/1434
others have tested it with good success, so give it a whirl!
waiting for it to be reviewed/merged. please put any feedback into the PR
directly.
thanks!
also, multiple calls to mapPartitions() will be pipelined by the spark
execution engine into a single stage, so the overhead is minimal.
On Fri, Jun 13, 2014 at 9:28 PM, zhen z...@latrobe.edu.au wrote:
Thank you for your suggestion. We will try it out and see how it performs.
We
think the
yes, spark attempts to achieve data locality (PROCESS_LOCAL or NODE_LOCAL)
where possible just like MapReduce. it's a best practice to co-locate your
Spark Workers on the same nodes as your HDFS Name Nodes for just this
reason.
this is achieved through the RDD.preferredLocations() interface
as brian g alluded to earlier, you can use DStream.mapPartitions() to
return the partition-local top 10 for each partition. once you collect
the results from all the partitions, you can do a global top 10 merge sort
across all partitions.
this leads to a much much-smaller dataset to be shuffled
Tachyon is another option - this is the off heap StorageLevel specified
when persisting RDDs:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.storage.StorageLevel
or just use HDFS. this requires subsequent Applications/SparkContext's to
reload the data from disk, of
great questions, weide. in addition, i'd also like to hear more about how
to horizontally scale a spark-streaming cluster.
i've gone through the samples (standalone mode) and read the documentation,
but it's still not clear to me how to scale this puppy out under high load.
i assume i add more
if you want to use true Spark Streaming (not the same as Hadoop
Streaming/Piping, as Mayur pointed out), you can use the DStream.union()
method as described in the following docs:
http://spark.apache.org/docs/0.9.1/streaming-custom-receivers.html
not sure if this directly addresses your issue, peter, but it's worth
mentioned a handy AWS EMR utility called s3distcp that can upload a single
HDFS file - in parallel - to a single, concatenated S3 file once all the
partitions are uploaded. kinda cool.
here's some info:
as Mayur indicated, it's odd that you are seeing better performance from a
less-local configuration. however, the non-deterministic behavior that you
describe is likely caused by GC pauses in your JVM process.
take note of the *spark.locality.wait* configuration parameter described
here:
or how about the UpdateStateByKey() operation?
https://spark.apache.org/docs/0.9.0/streaming-programming-guide.html
the StatefulNetworkWordCount example demonstrates how to keep state across RDDs.
On Mar 28, 2014, at 8:44 PM, Mayur Rustagi mayur.rust...@gmail.com wrote:
Are you referring to
@eric-
i saw this exact issue recently while working on the KinesisWordCount.
are you passing local[2] to your example as the MASTER arg versus just
local or local[1]?
you need at least 2. it's documented as n1 in the scala source docs -
which is easy to mistake for n=1.
i just ran the
87 matches
Mail list logo