1.4.1 in production

2015-07-20 Thread igor.berman
Hi, do somebody already uses version 1.4.1 in production? any problems? thanks in advance -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-4-1-in-production-tp23909.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: use S3-Compatible Storage with spark

2015-07-20 Thread Akhil Das
Not in the uri, but in the hadoop configuration you can specify it. property namefs.s3a.endpoint/name descriptionAWS S3 endpoint to connect to. An up-to-date list is provided in the AWS Documentation: regions and endpoints. Without this property, the standard region (s3.amazonaws.com)

Re: how to start reading the spark source code?

2015-07-20 Thread Yang
ok got some headstart: pull the git source to 14719b93ff4ea7c3234a9389621be3c97fa278b9 (first release so that I could at least build it) then build it according to README.md, then get eclipse setup , with scala-ide then create new scala project, set the project directory to be

Re: how to start reading the spark source code?

2015-07-20 Thread Yang
also one peculiar difference vs Hadoop MR is that the partition/split/part of RDD is as much an operation as it's data, since an RDD is associated with a transformation, and a lineage of all its ancestor RDDs. so when the partition is transferred to a new executor/worker (potentially on another

Hive Query(Top N)

2015-07-20 Thread Ravisankar Mani
Hi everyone I have tried to to achieve hierarchical based (index mode) top n creation using spark query. it taken more time when i execute following query Select SUM(`adventurepersoncontacts`.`contactid`) AS `adventurepersoncontacts_contactid` , `adventurepersoncontacts`.`fullname` AS

Re: Exception while triggering spark job from remote jvm

2015-07-20 Thread Akhil Das
Just make sure there is no firewall/network blocking the requests as its complaining about timeout. Thanks Best Regards On Mon, Jul 20, 2015 at 1:14 AM, ankit tyagi ankittyagi.mn...@gmail.com wrote: Just to add more information. I have checked the status of this file, not a single block is

Re: How to restart Twitter spark stream

2015-07-20 Thread Akhil Das
Jorn meant something like this: val filteredStream = twitterStream.transform(rdd ={ val newRDD = scc.sc.textFile(/this/file/will/be/updated/frequently).map(x = (x,1)) rdd.join(newRDD) }) ​newRDD will work like a filter when you do the join.​ Thanks Best Regards On Sun, Jul 19, 2015 at 9:32

Re: Kmeans Labeled Point RDD

2015-07-20 Thread plazaster
Has there been any progress on this, I am in the same boat. I posted a similar question to Stack Exchange. http://stackoverflow.com/questions/31447141/spark-mllib-kmeans-from-dataframe-and-back-again -- View this message in context:

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-20 Thread Nicolas Phung
Hi Cody, Thanks for you help. It seems there's something wrong with some messages within my Kafka topics then. I don't understand how, I can get bigger or incomplete message since I use default configuration to accept only 1Mb message in my Kafka topic. If you have any others informations or

Re: Using Dataframe write with newHdoopApi

2015-07-20 Thread ayan guha
Update: I have managed to use df.rdd to complete es integration but I preferred df.write. is it possible or upcoming? On 18 Jul 2015 23:19, ayan guha guha.a...@gmail.com wrote: Hi I am trying to use DF and save it to Elasticsearch using newHadoopApi (because I am using python). Can anyone

Local Repartition

2015-07-20 Thread Daniel Haviv
Hi, My data is constructed from a lot of small files which results in a lot of partitions per RDD. Is there some way to locally repartition the RDD without shuffling so that all of the partitions that reside on a specific node will become X partitions on the same node ? Thank you. Daniel

Proper saving/loading of MatrixFactorizationModel

2015-07-20 Thread Petr Shestov
Hi all! I have MatrixFactorizationModel object. If I'm trying to recommend products to single user right after constructing model through ALS.train(...) then it takes 300ms (for my data and hardware). But if I save model to disk and load it back then recommendation takes almost 2000ms. Also

PySpark Nested Json Parsing

2015-07-20 Thread Ajay
Hi, I am new to Apache Spark. I am trying to parse nested json using pyspark. Here is the code by which I am trying to parse Json. I am using Apache Spark 1.2.0 version of cloudera CDH 5.3.2. lines = sc.textFile(inputFile) import json def func(x): json_str = json.loads(x) if json_str['label']:

Re: Using reference for RDD is safe?

2015-07-20 Thread Mina
Hi, thank you for your answer. but i was talking about function reference. I want to transform an RDD using a function consisting of multiple transforms. For example def transformFunc1(rdd: RDD[Int]): RDD[Int] = { } val rdd2 = transformFunc1(rdd1)... here i am using reference, i think but i

LDA on a large dataset

2015-07-20 Thread Peter Zvirinsky
Hello, I'm trying to run LDA on a relatively large dataset (size 100-200 G), but with no luck so far. At first I made sure that the executors have enough memory with respect to the vocabulary size and number of topics. After that I ran LDA with default EMLDAOptimizer, but learning failed after

spark streaming 1.3 issues

2015-07-20 Thread Shushant Arora
Hi 1.I am using spark streaming 1.3 for reading from a kafka queue and pushing events to external source. I passed in my job 20 executors but it is showing only 6 in executor tab ? When I used highlevel streaming 1.2 - its showing 20 executors. My cluster is 10 node yarn cluster with each node

Re: JdbcRDD and ClassTag issue

2015-07-20 Thread nitinkalra2000
Thanks Sujee :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JdbcRDD-and-ClassTag-issue-tp18570p23912.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To

Re: Local Repartition

2015-07-20 Thread Doug Balog
Hi Daniel, Take a look at .coalesce() I’ve seen good results by coalescing to num executors * 10, but I’m still trying to figure out the optimal number of partitions per executor. To get the number of executors, sc.getConf.getInt(“spark.executor.instances”,-1) Cheers, Doug On Jul 20, 2015,

Apache Spark : spark.eventLog.dir on Windows Environment

2015-07-20 Thread nitinkalra2000
Hi All, I am working on Spark 1.4 on windows environment. I have to set eventLog directory so that I can reopen the Spark UI after application has finished. But I am not able to set eventLog.dir, It gives an error on Windows environment. Configuation is : entry key=spark.eventLog.enabled

Does Spark streaming support is there with RabbitMQ

2015-07-20 Thread Jeetendra Gangele
Does Apache spark support RabbitMQ. I have messages on RabbitMQ and I want to process them using Apache Spark streaming does it scale? Regards Jeetendra

Re: Does Spark streaming support is there with RabbitMQ

2015-07-20 Thread Todd Nist
There is one package available on the spark-packages site, http://spark-packages.org/package/Stratio/RabbitMQ-Receiver The source is here: https://github.com/Stratio/RabbitMQ-Receiver Not sure that meets your needs or not. -Todd On Mon, Jul 20, 2015 at 8:52 AM, Jeetendra Gangele

Re: spark streaming 1.3 issues

2015-07-20 Thread Shushant Arora
Is coalesce not applicable to kafkaStream ? How to do coalesce on kafkadirectstream its not there in api ? Shall calling repartition on directstream with number of executors as numpartitions will imrove perfromance ? Does in 1.3 tasks get launched for partitions which are empty? Does driver makes

Joda Time best practice?

2015-07-20 Thread algermissen1971
Hi, I am having trouble with Joda Time in a Spark application and saw by now that I am not the only one (generally seems to have to do with serialization and internal caches of the Joda Time objects). Is there a known best practice to work around these issues? Jan

Re: Does Spark streaming support is there with RabbitMQ

2015-07-20 Thread Jeetendra Gangele
Thanks Todd, I m not sure whether somebody has used it or not. can somebody confirm if this integrate nicely with Spark streaming? On 20 July 2015 at 18:43, Todd Nist tsind...@gmail.com wrote: There is one package available on the spark-packages site,

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-20 Thread Cody Koeninger
I'd try logging the offsets for each message, see where problems start, then try using the console consumer starting at those offsets and see if you can reproduce the problem. On Mon, Jul 20, 2015 at 2:15 AM, Nicolas Phung nicolas.ph...@gmail.com wrote: Hi Cody, Thanks for you help. It seems

Web UI Links

2015-07-20 Thread Bob Corsaro
I'm running a spark cluster and I'd like to access the Spark-UI from outside the LAN. The problem is all the links are to internal IP addresses. Is there anyway to config hostnames for each of the hosts in the cluster and use those for the links?

k-means iteration not terminate

2015-07-20 Thread Pa Rö
hi community, i have write a spark k-means app. now i run it on a cluster. my job start and at iteration nine or ten the process stop. in the spark dashbord all time shown is running, but nothing happend, no exceptions. my setting is the following: 1000 input points k=10 maxIteration=30 a tree

Re: Local Repartition

2015-07-20 Thread Daniel Haviv
Thanks Doug, coalesce might invoke a shuffle as well. I don't think what I'm suggesting is a feature but it definitely should be. Daniel On Mon, Jul 20, 2015 at 4:15 PM, Doug Balog d...@balog.net wrote: Hi Daniel, Take a look at .coalesce() I’ve seen good results by coalescing to num

Re: PySpark Nested Json Parsing

2015-07-20 Thread Davies Liu
Could you try SQLContext.read.json()? On Mon, Jul 20, 2015 at 9:06 AM, Davies Liu dav...@databricks.com wrote: Before using the json file as text file, can you make sure that each json string can fit in one line? Because textFile() will split the file by '\n' On Mon, Jul 20, 2015 at 3:26 AM,

Re: PySpark Nested Json Parsing

2015-07-20 Thread Davies Liu
Before using the json file as text file, can you make sure that each json string can fit in one line? Because textFile() will split the file by '\n' On Mon, Jul 20, 2015 at 3:26 AM, Ajay ajay0...@gmail.com wrote: Hi, I am new to Apache Spark. I am trying to parse nested json using pyspark.

dataframes sql order by not total ordering

2015-07-20 Thread Carol McDonald
the following query on the Movielens dataset , is sorting by the count of ratings for a movie. It looks like the results are ordered by partition ? scala val results =sqlContext.sql(select movies.title, movierates.maxr, movierates.minr, movierates.cntu from(SELECT ratings.product,

Re: Joda Time best practice?

2015-07-20 Thread Harish Butani
Hey Jan, Can you provide more details on the serialization and cache issues. If you are looking for datetime functionality with spark-sql please consider: https://github.com/SparklineData/spark-datetime It provides a simple way to combine joda datetime expressions with spark sql. regards,

What is the correct syntax of using Spark streamingContext.fileStream()?

2015-07-20 Thread unk1102
Hi I am trying to find correct way to use Spark Streaming API streamingContext.fileStream(String,ClassK,ClassV,ClassF) I tried to find example but could not find it anywhere in either Spark documentation. I have to stream files in hdfs which is of custom hadoop format.

Re: PySpark Nested Json Parsing

2015-07-20 Thread Naveen Madhire
I had the similar issue with spark 1.3 After migrating to Spark 1.4 and using sqlcontext.read.json it worked well I think you can look at dataframe select and explode options to read the nested json elements, array etc. Thanks. On Mon, Jul 20, 2015 at 11:07 AM, Davies Liu dav...@databricks.com

RE: Spark and SQL Server

2015-07-20 Thread Young, Matthew T
When attempting to write a Dataframe to SQL Server that contains java.sql.Timestamp or java.lang.boolean objects I get errors about the query that is formed being invalid. Specifically, java.sql.Timestamp objects try to be written as the Timestamp type, which is not appropriate for date/time

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-20 Thread Cody Koeninger
Yeah, in the function you supply for the messageHandler parameter to createDirectStream, catch the exception and do whatever makes sense for your application. On Mon, Jul 20, 2015 at 11:58 AM, Nicolas Phung nicolas.ph...@gmail.com wrote: Hello, Using the old Spark Streaming Kafka API, I got

Broadcast variables in R

2015-07-20 Thread Serge Franchois
I've searched high and low to use broadcast variables in R. Is is possible at all? I don't see them mentioned in the SparkR API. Or is there another way of using this feature? I need to share a large amount of data between executors. At the moment, I get warned about my task being too large. I

RE: Spark and SQL Server

2015-07-20 Thread Young, Matthew T
Thanks Davies, that resolves the issue with Python. I was using the Java/Scala DataFrame documentation https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrameWriter.html and assuming that it was the same for PySpark

Re: Local Repartition

2015-07-20 Thread Daniel Haviv
Great explanation. Thanks guys! Daniel On 20 ביולי 2015, at 18:12, Silvio Fiorito silvio.fior...@granturing.com wrote: Hi Daniel, Coalesce, by default will not cause a shuffle. The second parameter when set to true will cause a full shuffle. This is actually what repartition does

Re: use S3-Compatible Storage with spark

2015-07-20 Thread Schmirr Wurst
Thanks, that is what I was looking for... Any Idea where I have to store and reference the corresponding hadoop-aws-2.6.0.jar ?: java.io.IOException: No FileSystem for scheme: s3n 2015-07-20 8:33 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com: Not in the uri, but in the hadoop configuration

Re: Local Repartition

2015-07-20 Thread Silvio Fiorito
Hi Daniel, Coalesce, by default will not cause a shuffle. The second parameter when set to true will cause a full shuffle. This is actually what repartition does (calls coalesce with shuffle=true). It will attempt to keep colocated partitions together (as you describe) on the same executor.

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-20 Thread Arun Ahuja
Cool, I tried that as well, and doesn't seem different: spark.yarn.jar seems set [image: Inline image 1] This actually doesn't change the classpath, not sure if it should: [image: Inline image 3] But same netlib warning. Thanks for the help! - Arun On Fri, Jul 17, 2015 at 3:18 PM, Sandy

Re: Spark and SQL Server

2015-07-20 Thread Davies Liu
Sorry for the confusing. What's the other issues? On Mon, Jul 20, 2015 at 8:26 AM, Young, Matthew T matthew.t.yo...@intel.com wrote: Thanks Davies, that resolves the issue with Python. I was using the Java/Scala DataFrame documentation

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-20 Thread Nicolas Phung
Hello, Using the old Spark Streaming Kafka API, I got the following around the same offset: kafka.message.InvalidMessageException: Message is corrupt (stored crc = 3561357254, computed crc = 171652633) at kafka.message.Message.ensureValid(Message.scala:166) at

Re: Joda Time best practice?

2015-07-20 Thread Harish Butani
Can you post details on how to reproduce the NPE On Mon, Jul 20, 2015 at 1:19 PM, algermissen1971 algermissen1...@icloud.com wrote: Hi Harish, On 20 Jul 2015, at 20:37, Harish Butani rhbutani.sp...@gmail.com wrote: Hey Jan, Can you provide more details on the serialization and cache

Increment counter variable in RDD transformation function

2015-07-20 Thread dlmarion
I'm trying to keep track of some information in a RDD.flatMap() function (using Java API in 1.4.0). I have two longs in the function, and I am incrementing them when appropriate, and checking their values to determine how many objects to output from the function. I'm not trying to read the

Re: LDA on a large dataset

2015-07-20 Thread Feynman Liang
LDAOptimizer.scala:421 collects to driver a numTopics by vocabSize matrix of summary statistics. I suspect that this is what's causing the failure. One thing you may try doing is decreasing the vocabulary size. One possibility would be to use a HashingTF if you don't mind dimension reduction via

Re: Silly question about building Spark 1.4.1

2015-07-20 Thread Michael Segel
Thanks Dean… I was building based on the information found on the Spark 1.4.1 documentation. So I have to ask the following: Shouldn’t the examples be updated to reflect Hadoop 2.6 or are the vendors’ distro not up to 2.6 and that’s why its still showing 2.4? Also I’m trying to build with

Re: Joda Time best practice?

2015-07-20 Thread algermissen1971
On 20 Jul 2015, at 23:20, Harish Butani rhbutani.sp...@gmail.com wrote: Can you post details on how to reproduce the NPE Essentially it is like this: I have a scala case class that contains a Joda DateTime attribute and instances of this class are updated using updateStateByKey. When a

Re: Broadcast variables in R

2015-07-20 Thread Eskilson,Aleksander
Hi Serge, The broadcast function was made private when SparkR merged into Apache Spark for the 1.4.0 release. You can still use broadcast by specifying the private namespace though. SparkR:::broadcast(sc, obj) The RDD methods were considered very low-level, and the SparkR devs are still

Re: dataframes sql order by not total ordering

2015-07-20 Thread Michael Armbrust
An ORDER BY needs to be on the outermost query otherwise subsequent operations (such as the join) could reorder the tuples. On Mon, Jul 20, 2015 at 9:25 AM, Carol McDonald cmcdon...@maprtech.com wrote: the following query on the Movielens dataset , is sorting by the count of ratings for a

Re: Silly question about building Spark 1.4.1

2015-07-20 Thread Dean Wampler
hadoop-2.6 is supported (look for profile XML in the pom.xml file). For Hive, add -Phive -Phive-thriftserver (See http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables) for more details. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: Silly question about building Spark 1.4.1

2015-07-20 Thread Ted Yu
In master (as well as 1.4.1) I don't see hive profile in pom.xml I do find hive-provided profile, though. FYI On Mon, Jul 20, 2015 at 1:05 PM, Dean Wampler deanwamp...@gmail.com wrote: hadoop-2.6 is supported (look for profile XML in the pom.xml file). For Hive, add -Phive

Re: Joda Time best practice?

2015-07-20 Thread algermissen1971
Hi Harish, On 20 Jul 2015, at 20:37, Harish Butani rhbutani.sp...@gmail.com wrote: Hey Jan, Can you provide more details on the serialization and cache issues. My symptom is that I have a Joda DateTime on which I can call toString and getMillis without problems, but when I call getYear I

Fwd: Silly question about building Spark 1.4.1

2015-07-20 Thread Michael Segel
Sorry, Should have sent this to user… However… it looks like the docs page may need some editing? Thx -Mike Begin forwarded message: From: Michael Segel msegel_had...@hotmail.com Subject: Silly question about building Spark 1.4.1 Date: July 20, 2015 at 12:26:40 PM MST To:

Re: Web UI Links

2015-07-20 Thread Bob Corsaro
I figured this out after spelunking the UI code a little. The trick is to set the SPARK_PUBLIC_DNS environmental variable to the public DNS name of each server in the cluster, per node. I'm running in standalone mode, so it was just a matter of adding the setting to spark-env.sh. On Mon, Jul 20,

Re: How to restart Twitter spark stream

2015-07-20 Thread Zoran Jeremic
Thanks for explanation. If I understand this correctly, in this approach I would actually stream everything from Twitter, and perform filtering in my application using Spark. Isn't this too much overhead if my application is interested in listening for couple of hundreds or thousands hashtags? On

Re: Spark-hive parquet schema evolution

2015-07-20 Thread Jerrick Hoang
I'm new to Spark, any ideas would be much appreciated! Thanks On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang jerrickho...@gmail.com wrote: Hi all, I'm aware of the support for schema evolution via DataFrame API. Just wondering what would be the best way to go about dealing with schema

Spark 1.4.1,MySQL and DataFrameReader.read.jdbc fun

2015-07-20 Thread Aaron
I have Spark 1.4.1, running on a YARN cluster. When I do a pyspark, in yarn-client mode: pyspark --jars ~/dev/spark/lib/mysql-connector-java-5.1.36-bin.jar --driver-class-path ~/dev/spark/lib/mysql-connector-java-5.1.36-bin.jar and then do the equivalent of.. tbl =

Re: Counting distinct values for a key?

2015-07-20 Thread N B
Hi Jerry, In fact, HashSet approach is what we took earlier. However, this did not work with a Windowed DStream (i.e. if we provide a forward and inverse reduce operation). The reason is that the inverse reduce tries to remove values that may still exist elsewhere in the window and should not

Is SPARK is the right choice for traditional OLAP query processing?

2015-07-20 Thread renga.kannan
All, I really appreciate anyone's input on this. We are having a very simple traditional OLAP query processing use case. Our use case is as follows. 1. We have a customer sales order table data coming from RDBMs table. 2. There are many dimension columns in the sales order table. For each of

RE: Kmeans Labeled Point RDD

2015-07-20 Thread Mohammed Guller
I responded to your question on SO. Let me know if this what you wanted. http://stackoverflow.com/a/31528274/2336943 Mohammed -Original Message- From: plazaster [mailto:michaelplaz...@gmail.com] Sent: Sunday, July 19, 2015 11:38 PM To: user@spark.apache.org Subject: Re: Kmeans

RE: Data frames select and where clause dependency

2015-07-20 Thread Mohammed Guller
Michael, How would the Catalyst optimizer optimize this version? df.filter(df(filter_field) === value).select(field1).show() Would it still read all the columns in df or would it read only “filter_field” and “field1” since only two columns are used (assuming other columns from df are not used

Re: Data frames select and where clause dependency

2015-07-20 Thread Harish Butani
Yes via: org.apache.spark.sql.catalyst.optimizer.ColumnPruning See DefaultOptimizer.batches for list of logical rewrites. You can see the optimized plan by printing: df.queryExecution.optimizedPlan On Mon, Jul 20, 2015 at 5:22 PM, Mohammed Guller moham...@glassbeam.com wrote: Michael, How

RE: Data frames select and where clause dependency

2015-07-20 Thread Mohammed Guller
Thanks, Harish. Mike – this would be a cleaner version for your use case: df.filter(df(filter_field) === value).select(field1).show() Mohammed From: Harish Butani [mailto:rhbutani.sp...@gmail.com] Sent: Monday, July 20, 2015 5:37 PM To: Mohammed Guller Cc: Michael Armbrust; Mike Trienis;

Re: Data frames select and where clause dependency

2015-07-20 Thread Mike Trienis
Definitely, thanks Mohammed. On Mon, Jul 20, 2015 at 5:47 PM, Mohammed Guller moham...@glassbeam.com wrote: Thanks, Harish. Mike – this would be a cleaner version for your use case: df.filter(df(filter_field) === value).select(field1).show() Mohammed *From:* Harish Butani

Re: Increment counter variable in RDD transformation function

2015-07-20 Thread Ted Yu
Please see http://spark.apache.org/docs/latest/programming-guide.html#local-vs-cluster-modes Cheers On Mon, Jul 20, 2015 at 3:21 PM, dlmar...@comcast.net wrote: I’m trying to keep track of some information in a RDD.flatMap() function (using Java API in 1.4.0). I have two longs in the

spark streaming 1.3 coalesce on kafkadirectstream

2015-07-20 Thread Shushant Arora
does spark streaming 1.3 launches task for each partition offset range whether that is 0 or not ? If yes, how can I enforce it to not to launch tasks for empty rdds.Not able t o use coalesce on directKafkaStream. Shall we enforce repartitioning always before processing direct stream ? use case

standalone to connect mysql

2015-07-20 Thread Jack Yang
Hi there, I would like to use spark to access the data in mysql. So firstly I tried to run the program using: spark-submit --class sparkwithscala.SqlApp --driver-class-path /home/lib/mysql-connector-java-5.1.34.jar --master local[4] /home/myjar.jar that returns me the correct results. Then I