Hi,
do somebody already uses version 1.4.1 in production? any problems?
thanks in advance
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/1-4-1-in-production-tp23909.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Not in the uri, but in the hadoop configuration you can specify it.
property
namefs.s3a.endpoint/name
descriptionAWS S3 endpoint to connect to. An up-to-date list is
provided in the AWS Documentation: regions and endpoints. Without this
property, the standard region (s3.amazonaws.com)
ok got some headstart:
pull the git source to 14719b93ff4ea7c3234a9389621be3c97fa278b9 (first
release so that I could at least build it)
then build it according to README.md,
then get eclipse setup , with scala-ide
then create new scala project, set the project directory to be
also one peculiar difference vs Hadoop MR is that the partition/split/part
of RDD is as much an operation as it's data, since an RDD is associated
with a transformation, and a lineage of all its ancestor RDDs. so when the
partition is transferred to a new executor/worker (potentially on another
Hi everyone
I have tried to to achieve hierarchical based (index mode) top n creation
using spark query. it taken more time when i execute following query
Select SUM(`adventurepersoncontacts`.`contactid`) AS
`adventurepersoncontacts_contactid` ,
`adventurepersoncontacts`.`fullname` AS
Just make sure there is no firewall/network blocking the requests as its
complaining about timeout.
Thanks
Best Regards
On Mon, Jul 20, 2015 at 1:14 AM, ankit tyagi ankittyagi.mn...@gmail.com
wrote:
Just to add more information. I have checked the status of this file, not
a single block is
Jorn meant something like this:
val filteredStream = twitterStream.transform(rdd ={
val newRDD = scc.sc.textFile(/this/file/will/be/updated/frequently).map(x
= (x,1))
rdd.join(newRDD)
})
newRDD will work like a filter when you do the join.
Thanks
Best Regards
On Sun, Jul 19, 2015 at 9:32
Has there been any progress on this, I am in the same boat.
I posted a similar question to Stack Exchange.
http://stackoverflow.com/questions/31447141/spark-mllib-kmeans-from-dataframe-and-back-again
--
View this message in context:
Hi Cody,
Thanks for you help. It seems there's something wrong with some messages
within my Kafka topics then. I don't understand how, I can get bigger or
incomplete message since I use default configuration to accept only 1Mb
message in my Kafka topic. If you have any others informations or
Update: I have managed to use df.rdd to complete es integration but I
preferred df.write. is it possible or upcoming?
On 18 Jul 2015 23:19, ayan guha guha.a...@gmail.com wrote:
Hi
I am trying to use DF and save it to Elasticsearch using newHadoopApi
(because I am using python). Can anyone
Hi,
My data is constructed from a lot of small files which results in a lot of
partitions per RDD.
Is there some way to locally repartition the RDD without shuffling so that
all of the partitions that reside on a specific node will become X
partitions on the same node ?
Thank you.
Daniel
Hi all!
I have MatrixFactorizationModel object. If I'm trying to recommend products to
single user right after constructing model through ALS.train(...) then it takes
300ms (for my data and hardware). But if I save model to disk and load it back
then recommendation takes almost 2000ms. Also
Hi,
I am new to Apache Spark. I am trying to parse nested json using pyspark.
Here is the code by which I am trying to parse Json.
I am using Apache Spark 1.2.0 version of cloudera CDH 5.3.2.
lines = sc.textFile(inputFile)
import json
def func(x):
json_str = json.loads(x)
if json_str['label']:
Hi, thank you for your answer. but i was talking about function reference.
I want to transform an RDD using a function consisting of multiple
transforms.
For example
def transformFunc1(rdd: RDD[Int]): RDD[Int] = {
}
val rdd2 = transformFunc1(rdd1)...
here i am using reference, i think but i
Hello,
I'm trying to run LDA on a relatively large dataset (size 100-200 G), but
with no luck so far.
At first I made sure that the executors have enough memory with respect to
the vocabulary size and number of topics.
After that I ran LDA with default EMLDAOptimizer, but learning failed after
Hi
1.I am using spark streaming 1.3 for reading from a kafka queue and pushing
events to external source.
I passed in my job 20 executors but it is showing only 6 in executor tab ?
When I used highlevel streaming 1.2 - its showing 20 executors. My cluster
is 10 node yarn cluster with each node
Thanks Sujee :)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/JdbcRDD-and-ClassTag-issue-tp18570p23912.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To
Hi Daniel,
Take a look at .coalesce()
I’ve seen good results by coalescing to num executors * 10, but I’m still
trying to figure out the
optimal number of partitions per executor.
To get the number of executors, sc.getConf.getInt(“spark.executor.instances”,-1)
Cheers,
Doug
On Jul 20, 2015,
Hi All,
I am working on Spark 1.4 on windows environment. I have to set eventLog
directory so that I can reopen the Spark UI after application has finished.
But I am not able to set eventLog.dir, It gives an error on Windows
environment.
Configuation is :
entry key=spark.eventLog.enabled
Does Apache spark support RabbitMQ. I have messages on RabbitMQ and I want
to process them using Apache Spark streaming does it scale?
Regards
Jeetendra
There is one package available on the spark-packages site,
http://spark-packages.org/package/Stratio/RabbitMQ-Receiver
The source is here:
https://github.com/Stratio/RabbitMQ-Receiver
Not sure that meets your needs or not.
-Todd
On Mon, Jul 20, 2015 at 8:52 AM, Jeetendra Gangele
Is coalesce not applicable to kafkaStream ? How to do coalesce on
kafkadirectstream its not there in api ?
Shall calling repartition on directstream with number of executors as
numpartitions will imrove perfromance ?
Does in 1.3 tasks get launched for partitions which are empty? Does driver
makes
Hi,
I am having trouble with Joda Time in a Spark application and saw by now that I
am not the only one (generally seems to have to do with serialization and
internal caches of the Joda Time objects).
Is there a known best practice to work around these issues?
Jan
Thanks Todd,
I m not sure whether somebody has used it or not. can somebody confirm if
this integrate nicely with Spark streaming?
On 20 July 2015 at 18:43, Todd Nist tsind...@gmail.com wrote:
There is one package available on the spark-packages site,
I'd try logging the offsets for each message, see where problems start,
then try using the console consumer starting at those offsets and see if
you can reproduce the problem.
On Mon, Jul 20, 2015 at 2:15 AM, Nicolas Phung nicolas.ph...@gmail.com
wrote:
Hi Cody,
Thanks for you help. It seems
I'm running a spark cluster and I'd like to access the Spark-UI from
outside the LAN. The problem is all the links are to internal IP addresses.
Is there anyway to config hostnames for each of the hosts in the cluster
and use those for the links?
hi community,
i have write a spark k-means app. now i run it on a cluster.
my job start and at iteration nine or ten the process stop.
in the spark dashbord all time shown is running, but nothing
happend, no exceptions.
my setting is the following:
1000 input points
k=10
maxIteration=30
a tree
Thanks Doug,
coalesce might invoke a shuffle as well.
I don't think what I'm suggesting is a feature but it definitely should be.
Daniel
On Mon, Jul 20, 2015 at 4:15 PM, Doug Balog d...@balog.net wrote:
Hi Daniel,
Take a look at .coalesce()
I’ve seen good results by coalescing to num
Could you try SQLContext.read.json()?
On Mon, Jul 20, 2015 at 9:06 AM, Davies Liu dav...@databricks.com wrote:
Before using the json file as text file, can you make sure that each
json string can fit in one line? Because textFile() will split the
file by '\n'
On Mon, Jul 20, 2015 at 3:26 AM,
Before using the json file as text file, can you make sure that each
json string can fit in one line? Because textFile() will split the
file by '\n'
On Mon, Jul 20, 2015 at 3:26 AM, Ajay ajay0...@gmail.com wrote:
Hi,
I am new to Apache Spark. I am trying to parse nested json using pyspark.
the following query on the Movielens dataset , is sorting by the count of
ratings for a movie. It looks like the results are ordered by partition
?
scala val results =sqlContext.sql(select movies.title, movierates.maxr,
movierates.minr, movierates.cntu from(SELECT ratings.product,
Hey Jan,
Can you provide more details on the serialization and cache issues.
If you are looking for datetime functionality with spark-sql please
consider: https://github.com/SparklineData/spark-datetime It provides a
simple way to combine joda datetime expressions with spark sql.
regards,
Hi I am trying to find correct way to use Spark Streaming API
streamingContext.fileStream(String,ClassK,ClassV,ClassF)
I tried to find example but could not find it anywhere in either Spark
documentation. I have to stream files in hdfs which is of custom hadoop
format.
I had the similar issue with spark 1.3
After migrating to Spark 1.4 and using sqlcontext.read.json it worked well
I think you can look at dataframe select and explode options to read the
nested json elements, array etc.
Thanks.
On Mon, Jul 20, 2015 at 11:07 AM, Davies Liu dav...@databricks.com
When attempting to write a Dataframe to SQL Server that contains
java.sql.Timestamp or java.lang.boolean objects I get errors about the query
that is formed being invalid. Specifically, java.sql.Timestamp objects try to
be written as the Timestamp type, which is not appropriate for date/time
Yeah, in the function you supply for the messageHandler parameter to
createDirectStream, catch the exception and do whatever makes sense for
your application.
On Mon, Jul 20, 2015 at 11:58 AM, Nicolas Phung nicolas.ph...@gmail.com
wrote:
Hello,
Using the old Spark Streaming Kafka API, I got
I've searched high and low to use broadcast variables in R.
Is is possible at all? I don't see them mentioned in the SparkR API.
Or is there another way of using this feature?
I need to share a large amount of data between executors.
At the moment, I get warned about my task being too large.
I
Thanks Davies, that resolves the issue with Python.
I was using the Java/Scala DataFrame documentation
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrameWriter.html
and assuming that it was the same for PySpark
Great explanation.
Thanks guys!
Daniel
On 20 ביולי 2015, at 18:12, Silvio Fiorito silvio.fior...@granturing.com
wrote:
Hi Daniel,
Coalesce, by default will not cause a shuffle. The second parameter when set
to true will cause a full shuffle. This is actually what repartition does
Thanks, that is what I was looking for...
Any Idea where I have to store and reference the corresponding
hadoop-aws-2.6.0.jar ?:
java.io.IOException: No FileSystem for scheme: s3n
2015-07-20 8:33 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com:
Not in the uri, but in the hadoop configuration
Hi Daniel,
Coalesce, by default will not cause a shuffle. The second parameter when set to
true will cause a full shuffle. This is actually what repartition does (calls
coalesce with shuffle=true).
It will attempt to keep colocated partitions together (as you describe) on the
same executor.
Cool, I tried that as well, and doesn't seem different:
spark.yarn.jar seems set
[image: Inline image 1]
This actually doesn't change the classpath, not sure if it should:
[image: Inline image 3]
But same netlib warning.
Thanks for the help!
- Arun
On Fri, Jul 17, 2015 at 3:18 PM, Sandy
Sorry for the confusing. What's the other issues?
On Mon, Jul 20, 2015 at 8:26 AM, Young, Matthew T
matthew.t.yo...@intel.com wrote:
Thanks Davies, that resolves the issue with Python.
I was using the Java/Scala DataFrame documentation
Hello,
Using the old Spark Streaming Kafka API, I got the following around the
same offset:
kafka.message.InvalidMessageException: Message is corrupt (stored crc =
3561357254, computed crc = 171652633)
at kafka.message.Message.ensureValid(Message.scala:166)
at
Can you post details on how to reproduce the NPE
On Mon, Jul 20, 2015 at 1:19 PM, algermissen1971 algermissen1...@icloud.com
wrote:
Hi Harish,
On 20 Jul 2015, at 20:37, Harish Butani rhbutani.sp...@gmail.com wrote:
Hey Jan,
Can you provide more details on the serialization and cache
I'm trying to keep track of some information in a RDD.flatMap() function
(using Java API in 1.4.0). I have two longs in the function, and I am
incrementing them when appropriate, and checking their values to determine
how many objects to output from the function. I'm not trying to read the
LDAOptimizer.scala:421 collects to driver a numTopics by vocabSize matrix
of summary statistics. I suspect that this is what's causing the failure.
One thing you may try doing is decreasing the vocabulary size. One
possibility would be to use a HashingTF if you don't mind dimension
reduction via
Thanks Dean…
I was building based on the information found on the Spark 1.4.1 documentation.
So I have to ask the following:
Shouldn’t the examples be updated to reflect Hadoop 2.6 or are the vendors’
distro not up to 2.6 and that’s why its still showing 2.4?
Also I’m trying to build with
On 20 Jul 2015, at 23:20, Harish Butani rhbutani.sp...@gmail.com wrote:
Can you post details on how to reproduce the NPE
Essentially it is like this:
I have a scala case class that contains a Joda DateTime attribute and instances
of this class are updated using updateStateByKey. When a
Hi Serge,
The broadcast function was made private when SparkR merged into Apache
Spark for the 1.4.0 release. You can still use broadcast by specifying the
private namespace though.
SparkR:::broadcast(sc, obj)
The RDD methods were considered very low-level, and the SparkR devs are
still
An ORDER BY needs to be on the outermost query otherwise subsequent
operations (such as the join) could reorder the tuples.
On Mon, Jul 20, 2015 at 9:25 AM, Carol McDonald cmcdon...@maprtech.com
wrote:
the following query on the Movielens dataset , is sorting by the count of
ratings for a
hadoop-2.6 is supported (look for profile XML in the pom.xml file).
For Hive, add -Phive -Phive-thriftserver (See
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables)
for more details.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
In master (as well as 1.4.1) I don't see hive profile in pom.xml
I do find hive-provided profile, though.
FYI
On Mon, Jul 20, 2015 at 1:05 PM, Dean Wampler deanwamp...@gmail.com wrote:
hadoop-2.6 is supported (look for profile XML in the pom.xml file).
For Hive, add -Phive
Hi Harish,
On 20 Jul 2015, at 20:37, Harish Butani rhbutani.sp...@gmail.com wrote:
Hey Jan,
Can you provide more details on the serialization and cache issues.
My symptom is that I have a Joda DateTime on which I can call toString and
getMillis without problems, but when I call getYear I
Sorry,
Should have sent this to user…
However… it looks like the docs page may need some editing?
Thx
-Mike
Begin forwarded message:
From: Michael Segel msegel_had...@hotmail.com
Subject: Silly question about building Spark 1.4.1
Date: July 20, 2015 at 12:26:40 PM MST
To:
I figured this out after spelunking the UI code a little. The trick is to
set the SPARK_PUBLIC_DNS environmental variable to the public DNS name of
each server in the cluster, per node. I'm running in standalone mode, so it
was just a matter of adding the setting to spark-env.sh.
On Mon, Jul 20,
Thanks for explanation.
If I understand this correctly, in this approach I would actually stream
everything from Twitter, and perform filtering in my application using
Spark. Isn't this too much overhead if my application is interested in
listening for couple of hundreds or thousands hashtags?
On
I'm new to Spark, any ideas would be much appreciated! Thanks
On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang jerrickho...@gmail.com
wrote:
Hi all,
I'm aware of the support for schema evolution via DataFrame API. Just
wondering what would be the best way to go about dealing with schema
I have Spark 1.4.1, running on a YARN cluster. When I do a pyspark,
in yarn-client mode:
pyspark --jars ~/dev/spark/lib/mysql-connector-java-5.1.36-bin.jar
--driver-class-path
~/dev/spark/lib/mysql-connector-java-5.1.36-bin.jar
and then do the equivalent of..
tbl =
Hi Jerry,
In fact, HashSet approach is what we took earlier. However, this did not
work with a Windowed DStream (i.e. if we provide a forward and inverse
reduce operation). The reason is that the inverse reduce tries to remove
values that may still exist elsewhere in the window and should not
All,
I really appreciate anyone's input on this. We are having a very simple
traditional OLAP query processing use case. Our use case is as follows.
1. We have a customer sales order table data coming from RDBMs table.
2. There are many dimension columns in the sales order table. For each of
I responded to your question on SO. Let me know if this what you wanted.
http://stackoverflow.com/a/31528274/2336943
Mohammed
-Original Message-
From: plazaster [mailto:michaelplaz...@gmail.com]
Sent: Sunday, July 19, 2015 11:38 PM
To: user@spark.apache.org
Subject: Re: Kmeans
Michael,
How would the Catalyst optimizer optimize this version?
df.filter(df(filter_field) === value).select(field1).show()
Would it still read all the columns in df or would it read only “filter_field”
and “field1” since only two columns are used (assuming other columns from df
are not used
Yes via: org.apache.spark.sql.catalyst.optimizer.ColumnPruning
See DefaultOptimizer.batches for list of logical rewrites.
You can see the optimized plan by printing: df.queryExecution.optimizedPlan
On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller moham...@glassbeam.com
wrote:
Michael,
How
Thanks, Harish.
Mike – this would be a cleaner version for your use case:
df.filter(df(filter_field) === value).select(field1).show()
Mohammed
From: Harish Butani [mailto:rhbutani.sp...@gmail.com]
Sent: Monday, July 20, 2015 5:37 PM
To: Mohammed Guller
Cc: Michael Armbrust; Mike Trienis;
Definitely, thanks Mohammed.
On Mon, Jul 20, 2015 at 5:47 PM, Mohammed Guller moham...@glassbeam.com
wrote:
Thanks, Harish.
Mike – this would be a cleaner version for your use case:
df.filter(df(filter_field) === value).select(field1).show()
Mohammed
*From:* Harish Butani
Please see
http://spark.apache.org/docs/latest/programming-guide.html#local-vs-cluster-modes
Cheers
On Mon, Jul 20, 2015 at 3:21 PM, dlmar...@comcast.net wrote:
I’m trying to keep track of some information in a RDD.flatMap() function
(using Java API in 1.4.0). I have two longs in the
does spark streaming 1.3 launches task for each partition offset range
whether that is 0 or not ?
If yes, how can I enforce it to not to launch tasks for empty rdds.Not able
t o use coalesce on directKafkaStream.
Shall we enforce repartitioning always before processing direct stream ?
use case
Hi there,
I would like to use spark to access the data in mysql. So firstly I tried to
run the program using:
spark-submit --class sparkwithscala.SqlApp --driver-class-path
/home/lib/mysql-connector-java-5.1.34.jar --master local[4] /home/myjar.jar
that returns me the correct results. Then I
69 matches
Mail list logo