Thank you Yanbo,
It looks like this is available in 1.6 version only.
Can you tell me how/when can I download version 1.6?
Thanks and Regards,
Vishnu Viswanath,
On Wed, Dec 2, 2015 at 4:37 AM, Yanbo Liang wrote:
> You can set "handleInvalid" to "skip" which help you skip
Again, just to be clear, silently throwing away data because your system
isn't working right is not the same as "recover from any Kafka leader
changes and offset out of ranges issue".
On Tue, Dec 1, 2015 at 11:27 PM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:
> Hi, if you
Here's what I set in a shell script to start the notebook:
export PYSPARK_PYTHON=~/anaconda/bin/python
export PYSPARK_DRIVER_PYTHON=~/anaconda/bin/ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
If you want to use HiveContext w/CDH:
export HADOOP_CONF_DIR=/etc/hive/conf
Then just run
Hi
I'm using PCA through the python interface for spark, as per the
instructions on this page:
https://spark.apache.org/docs/1.5.1/ml-features.html#pca
It works fine for learning the parameters and transforming the data.
However, I'm unable to find a way to retrieve the learnt PCA parameters. I
Hi,
I have a Spark job with many transformations (sequence of maps and
mapPartitions) and only one action in the end (DataFrame.write()). The
transformations return an RDD, so I need to create a DataFrame.
To be able to use sqlContext.createDataFrame() I need to know the schema of
the Row but for
Hi All,
Is there any Pub-Sub for JMS provided by Spark out of box like Kafka?
Thanks.
Regards,
Sam
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-JMS-tp5371p25548.html
Sent from the Apache Spark User List mailing list archive at
Using parallelize() on a dataset I'm only seeing two tasks rather than
the number of cores in the Mesos cluster. This is with spark 1.5.1 and
using the mesos coarse grained scheduler.
Running pyspark in a console seems to show that it's taking a while
before the mesos executors come online
Do you think it's a security issue if EMR started in VPC with a subnet
having Auto-assign Public IP: Yes
you can remove all Inbound rules having 0.0.0.0/0 Source in master and
slave Security Group
So, master and slave boxes will be accessible only for users who are on VPN
On Wed, Dec 2, 2015
On Tue, Dec 1, 2015 at 12:45 PM, Charles Allen
wrote:
> Is there a way to pass configuration file resources to be resolvable through
> the classloader?
Not in general. If you're using YARN, you can cheat and use
"spark.yarn.dist.files" which will place those files
EMR was a pain to configure on a private VPC last I tried. Has anyone had
success with that? I found spark-ec2 easier to use w private networking,
but also agree that I would use for prod.
-Dana
On Dec 1, 2015 12:29 PM, "Alexander Pivovarov" wrote:
> 1. Emr 4.2.0 has
Hi All,
I have the following use case for Spark Streaming -
There are 2 streams of data say - FlightBookings and Ticket
For each ticket, I need to associate it with relevant Booking info. There
are distinct applications for Booking and Ticket. The Booking streaming
application processes the
you might also coalesce to 1 (or some small number) before writing to avoid
creating a lot of files in that partition if you know that there is not a
ton of data.
On Wed, Dec 2, 2015 at 12:59 AM, Rishi Mishra wrote:
> As long as all your data is being inserted by Spark ,
Hi Ted an Felix
From: Ted Yu
Date: Sunday, November 29, 2015 at 10:37 AM
To: Andrew Davidson
Cc: Felix Cheung , "user @spark"
Subject: Re: possible bug spark/python/pyspark/rdd.py
Hi Dana,
Yes, we get VPC + EMR working but I'm not the person who deploys it. It is
related to subnet as Alex points out.
Just to want to add another point, spark-ec2 is nice to keep and improve
because it allows users to any version of spark (nightly-build for
example). EMR does not allow you
On Tue, Dec 1, 2015 at 9:43 PM, Anfernee Xu wrote:
> But I have a single server(JVM) that is creating SparkContext, are you
> saying Spark supports multiple SparkContext in the same JVM? Could you
> please clarify on this?
I'm confused. Nothing you said so far requires
Use the direct stream. You can put multiple topics in a single stream, and
differentiate them on a per-partition basis using the offset range.
On Wed, Dec 2, 2015 at 2:13 PM, dutrow wrote:
> I found the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2388
>
> It
My need is similar; I have 10+ topics and don't want to dedicate 10 cores to
processing all of them. Like yourself and others, the (String, String) pair
that comes out of the DStream has (null, StringData...) values instead of
(topic name, StringData...)
Did anyone ever find a way around this
Sigh... I want to use the direct stream and have recently brought in Redis
to persist the offsets, but I really like and need to have realtime metrics
on the GUI, so I'm hoping to have Direct and Receiver stream both working.
On Wed, Dec 2, 2015 at 3:17 PM Cody Koeninger
I found the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2388
It was marked as invalid.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25550.html
Sent from the Apache Spark User List mailing list
Data:
+---++
| label|features|
+---++
|0.13271745268556925|[-0.2006809895664...|
|0.23956421080605234|[-0.0938342314459...|
|0.47464690691431843|[0.14124846466227...|
|
I still have to propagate the file into the directory somehow, and also
that's marked as only for legacy jobs (deprecated?), so no, I have not
experimented with it yet.
On Wed, Dec 2, 2015 at 12:53 AM Rishi Mishra wrote:
> Did you try to use
As long as all your data is being inserted by Spark , hence using the same
hash partitioner, what Fengdong mentioned should work.
On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu
wrote:
> Hi
> you can try:
>
> if your table under location “/test/table/“ on HDFS
> and has
Do you want to load multiple tables by using sql ? JdbcRelation now only
can load single table. It doesn't accept sql as loading command.
On Wed, Dec 2, 2015 at 4:33 PM, censj wrote:
> hi Fengdong Yu:
> I want to use sqlContext.read.format('jdbc').options( ... ).load()
Mich, did you run this locally or on EC2 (I use EC2)? Is this problem
universal or specific to, say EC2? Many thanks
From: Mich Talebzadeh [mailto:m...@peridale.co.uk]
Sent: Wednesday, December 02, 2015 5:01 PM
To: Lin, Hao; user@spark.apache.org
Subject: RE: starting spark-shell throws
I actually don't have the folder /tmp/hive created in my master node, is that a
problem?
From: Mich Talebzadeh [mailto:m...@peridale.co.uk]
Sent: Wednesday, December 02, 2015 5:40 PM
To: Lin, Hao; user@spark.apache.org
Subject: RE: starting spark-shell throws /tmp/hive on HDFS should be writable
Please disregard the "window" functions...it turns out that was development
code. Everything else is correct.
val rawLEFT: DStream[String] = ssc.textFileStream(dirLEFT).
window(Seconds(30))
val rawRIGHT: DStream[String] = ssc.textFileStream(dirRIGHT).
window(Seconds(30))
should be
val
Hi all,
I have an app streaming from s3 (textFileStream) and recently I've observed
increasing delay and long time to list files:
INFO dstream.FileInputDStream: Finding new files took 394160 ms
...
INFO scheduler.JobScheduler: Total delay: 404.796 s for time 144910020
ms (execution: 10.154
This consumer which I mentioned does not silently throw away data. If
offset out of range it start for earliest offset and that is correct way of
recovery from this error.
Dibyendu
On Dec 2, 2015 9:56 PM, "Cody Koeninger" wrote:
> Again, just to be clear, silently throwing
You may try to set Hadoop conf "parquet.enable.summary-metadata" to
false to disable writing Parquet summary files (_metadata and
_common_metadata).
By default Parquet writes the summary files by collecting footers of all
part-files in the dataset while committing the job. Spark also follows
I believe that what differentiates reliable systems is individual
components should fail fast when their preconditions aren't met, and other
components should be responsible for monitoring them.
If a user of the direct stream thinks that your approach of restarting and
ignoring data loss is the
Yeah, Thats the example from the link I just posted.
-Sahil
On Thu, Dec 3, 2015 at 11:41 AM, Akhil Das
wrote:
> Something like this?
>
> val df =
> sqlContext.read.load("examples/src/main/resources/users.parquet")df.select("name",
>
Hello,
I have been trying to understand the LDA topic modeling example provided here:
https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda.
In the example, they load word count vectors from a text file that contains
these word counts and then they output
Something like this?
val df =
sqlContext.read.load("examples/src/main/resources/users.parquet")df.select("name",
"favorite_color").write.save("namesAndFavColors.parquet")
It will save the name, favorite_color columns to a parquet file. You can
read more information over here
If you haven't unsubscribed already, shoot an email to
user-unsubscr...@spark.apache.org Also read more here
http://spark.apache.org/community.html
Thanks
Best Regards
On Thu, Nov 26, 2015 at 7:51 AM, ngocan211 . wrote:
>
>
Did you go through the executor logs completely? Futures timed out
exception can occur mostly when one of the task/job spend way too much time
and fails to respond, this happens when there's a GC pause or memory
overhead.
Thanks
Best Regards
On Tue, Dec 1, 2015 at 12:09 AM, Spark Newbie
This is very intersting.
Thanks!!!
On Thu, Dec 3, 2015 at 8:28 AM, Sudhanshu Janghel <
sudhanshu.jang...@cloudwick.com> wrote:
> Hi,
>
> Here is a doc that I had created for my team. This has steps along with
> snapshots of how to setup debugging in spark using IntelliJ locally.
>
>
>
No, silently restarting from the earliest offset in the case of offset out
of range exceptions during a streaming job is not the "correct way of
recovery".
If you do that, your users will be losing data without knowing why. It's
more like a "way of ignoring the problem without actually
PTAL:
http://stackoverflow.com/questions/29213404/how-to-split-an-rdd-into-multiple-smaller-rdds-given-a-max-number-of-rows-per
-Sahil
On Thu, Dec 3, 2015 at 9:18 AM, Ram VISWANADHA <
ram.viswana...@dailymotion.com> wrote:
> Yes. That did not help.
>
> Best Regards,
> Ram
> From: Ted Yu
Did you see: http://spark.apache.org/docs/latest/sql-programming-guide.html
-Sahil
On Thu, Dec 3, 2015 at 11:35 AM, fightf...@163.com
wrote:
> HI,
> How could I save the spark sql cli running queries results and write the
> results to some local file ?
> Is there any
+1 looks like a bug
I think referencing trades() twice in multiplication is broken,
scala> trades.select(trades("quantity")*trades("quantity")).show
+-+
|(quantity * quantity)|
+-+
| null|
| null|
scala>
Very interested in that topic too, thanks Cheng for the direction!
We'll give it a try as well.
On 3 December 2015 at 01:40, Cheng Lian wrote:
> You may try to set Hadoop conf "parquet.enable.summary-metadata" to false
> to disable writing Parquet summary files
Yes. That did not help.
Best Regards,
Ram
From: Ted Yu >
Date: Wednesday, December 2, 2015 at 3:25 PM
To: Ram VISWANADHA
>
Cc: user
Are you read csv file ? If so you can use spark-csv which support skip
header
http://spark-packages.org/package/databricks/spark-csv
On Thu, Dec 3, 2015 at 10:52 AM, Divya Gehlot
wrote:
> Hi,
> I am new bee to Spark and Scala .
> As one of my requirement to read and
Oops 3 mins late. :)
Thanks
Best Regards
On Thu, Dec 3, 2015 at 11:49 AM, Sahil Sareen wrote:
> Yeah, Thats the example from the link I just posted.
>
> -Sahil
>
> On Thu, Dec 3, 2015 at 11:41 AM, Akhil Das
> wrote:
>
>> Something like this?
>>
You could use "filter" to eliminate headers from your text file RDD while
going over each line.
-Sahil
On Thu, Dec 3, 2015 at 9:37 AM, Jeff Zhang wrote:
> Are you read csv file ? If so you can use spark-csv which support skip
> header
>
>
Hi,
Here is a doc that I had created for my team. This has steps along with
snapshots of how to setup debugging in spark using IntelliJ locally.
https://docs.google.com/a/cloudwick.com/document/d/13kYPbmK61di0f_XxxJ-wLP5TSZRGMHE6bcTBjzXD0nA/edit?usp=sharing
Kind Regards,
Sudhanshu
On Thu, Dec
Thanks Gourav ! I will refer google for this.
Regards,
Vijay Gharge
On Thu, Dec 3, 2015 at 1:26 PM, Gourav Sengupta
wrote:
> Vijay,
>
> please Google for AWS lambda + S3 there are several used cases available.
> Lambda are event based triggers and are executed when
Hi,
I am new bee to Spark and Scala .
As one of my requirement to read and process multiple text files with
headers using DataFrame API .
How can I skip headers when processing data with DataFrame API
Thanks in advance .
Regards,
Divya
Well, even if you do correct retention and increase speed, OffsetOutOfRange
can still come depends on how your downstream processing is. And if that
happen , there is No Other way to recover old messages . So best bet here
from Streaming Job point of view is to start from earliest offset rather
Hello Gourav,
Can you please elaborate "trigger" part ?
Any reference link will be really useful !
On Thursday 3 December 2015, Gourav Sengupta
wrote:
> Hi,
>
> And so you have the money to keep a SPARK cluster up and running? The way
> I make it work is test the
Not quiet sure whats happening, but its not an issue with multiplication i
guess as the following query worked for me:
trades.select(trades("price")*9.5).show
+-+
|(price * 9.5)|
+-+
|199.5|
|228.0|
|190.0|
|199.5|
|190.0|
|
HI,
How could I save the spark sql cli running queries results and write the
results to some local file ?
Is there any available command ?
Thanks,
Sun.
fightf...@163.com
This doc will get you started
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ
Thanks
Best Regards
On Sun, Nov 29, 2015 at 9:48 PM, Masf wrote:
> Hi
>
> Is it possible to debug spark locally with IntelliJ or another
Have you seen this thread ?
http://search-hadoop.com/m/q3RTtvmsYMv0tKh2=Re+Upgrading+Spark+in+EC2+clusters
On Wed, Dec 2, 2015 at 2:39 PM, Andy Davidson wrote:
> I am using spark-1.5.1-bin-hadoop2.6. I used
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a
JavaRDD.saveAsTextFile is taking a long time to succeed. There are 10 tasks,
the first 9 complete in a reasonable time but the last task is taking a long
time to complete. The last task contains the maximum number of records like 90%
of the total number of records. Is there any way to
Have you tried calling coalesce() before saveAsTextFile ?
Cheers
On Wed, Dec 2, 2015 at 3:15 PM, Ram VISWANADHA <
ram.viswana...@dailymotion.com> wrote:
> JavaRDD.saveAsTextFile is taking a long time to succeed. There are 10
> tasks, the first 9 complete in a reasonable time but the last task
Hi,
And so you have the money to keep a SPARK cluster up and running? The way I
make it work is test the code in local system with a localised spark
installation and then create data pipeline triggered by lambda which starts
SPARK cluster and processes the data via SPARK steps and then terminates
I'm new to Apache Spark and an absolute beginner. I'm playing around with
Spark Streaming (API version 1.5.1) in Java and want to implement a
prototype which uses HyperLogLog to estimate distinct elements. I use the
stream-lib from clearspring (https://github.com/addthis/stream-lib).
I planned
Hi folks,
You're probably busy, but any update on this? :)
On 16 November 2015 at 16:04, Adrien Mogenet <
adrien.moge...@contentsquare.com> wrote:
> Name: Content Square
> URL: http://www.contentsquare.com
>
> Description:
> We use Spark to regularly read raw data, convert them into Parquet,
Oh, right! I think it was user@ at the time I wrote my first message but
it's clear now!
Thanks Sean,
On 2 December 2015 at 11:56, Sean Owen wrote:
> Same, not sure if anyone handles this particularly but I'll do it.
> This should go to dev@; I think we just put a note on
Same, not sure if anyone handles this particularly but I'll do it.
This should go to dev@; I think we just put a note on that wiki.
On Wed, Dec 2, 2015 at 10:53 AM, Adrien Mogenet
wrote:
> Hi folks,
>
> You're probably busy, but any update on this? :)
>
>
> On
I meant there is no streaming tab at all. It looks like I need version 1.6
Patcharee
On 02. des. 2015 11:34, Steve Loughran wrote:
The history UI doesn't update itself for live apps (SPARK-7889) -though I'm
working on it
Are you trying to view a running streaming job?
On 2 Dec 2015, at
Does anyone have a pointer to Jupyter configuration with pyspark? The current
material on python inotebook is out of date, and jupyter ignores ipython
profiles.
Thank you,
I am using spark-1.5.1-bin-hadoop2.6. I used
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster. Any idea how I
can upgrade to 1.5.2 prebuilt binary?
Also if I choose to build the binary, how would I upgrade my cluster?
Kind regards
Andy
The referenced link seems to be w.r.t. Hive on Spark which is still in its
own branch of Hive.
FYI
On Tue, Dec 1, 2015 at 11:23 PM, 张炜 wrote:
> Hello Ted and all,
> We are using Hive 1.2.1 and Spark 1.5.1
> I also noticed that there are other users reporting this
Have you taken a look
at streaming//src/test/java/org/apache/spark/streaming/JavaAPISuite.java ?
especially testUpdateStateByKeyWithInitial()
Cheers
On Wed, Dec 2, 2015 at 2:54 AM, JayKay wrote:
> I'm new to Apache Spark and an absolute beginner. I'm playing around
The pyspark app stdout/err log shows this oddity.
Traceback (most recent call last):
File "/root/spark/notebooks/ingest/XXX.py", line 86, in
print pdfRDD.collect()[:5]
File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 773,
in collect
File
You can get 1.6.0-RC1 from
http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
currently, but it's not the last release version.
2015-12-02 23:57 GMT+08:00 Vishnu Viswanath :
> Thank you Yanbo,
>
> It looks like this is available in 1.6 version
Hi all,
Wondering if someone can provide some insight why this pyspark app is
just hanging. Here is output.
...
15/12/03 01:47:05 INFO TaskSetManager: Starting task 21.0 in stage 0.0
(TID 21, 10.65.143.174, PROCESS_LOCAL, 1794787 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task
Thank you.
On Wed, Dec 2, 2015 at 8:12 PM, Yanbo Liang wrote:
> You can get 1.6.0-RC1 from
> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
> currently, but it's not the last release version.
>
> 2015-12-02 23:57 GMT+08:00 Vishnu Viswanath
70 matches
Mail list logo