Rdd Partitions issue

2015-10-15 Thread Renu Yadav
I am reading parquet file from a dir which has 400 file of max 180M size so while reading my partition should be 400 as split size is 256 M in my case But it is taking 787 partiition .Why is it so? Please help. Thanks, Renu

Fwd: Get the previous state string

2015-10-15 Thread Yogesh Vyas
-- Forwarded message -- From: Yogesh Vyas Date: Thu, Oct 15, 2015 at 6:08 PM Subject: Get the previous state string To: user@spark.apache.org Hi, I am new to Spark and was trying to do some experiments with it. I had a JavaPairDStream RDD. I

Re: SPARK SQL Error

2015-10-15 Thread Giridhar Maddukuri
Hi Dilip, I tried this option also spark-submit --master yarn --class org.spark.apache.CsvDataSource /home/cloudera/Desktop/TestMain.jar --files hdfs://quickstart.cloudera:8020/people_csv & getting similar error Exception in thread "main" org.apache.spark.SparkException: Could not parse Master

Re: org.apache.spark.sql.AnalysisException with registerTempTable

2015-10-15 Thread Yusuf Can Gürkan
Also, i create first table with this code: val landingDF = streamingJsonDF.selectExpr("get_json_object(json,'$.partner_id') partnerid", "from_unixtime(cast(get_json_object(json,'$.date') as int)+ 60 * 60 * 3) date",

Re: SPARK SQL Error

2015-10-15 Thread Dilip Biswal
Hi Giri, You are perhaps missing the "--files" option before the supplied hdfs file name ? spark-submit --master yarn --class org.spark.apache.CsvDataSource /home/cloudera/Desktop/TestMain.jar --files hdfs://quickstart.cloudera:8020/people_csv Please refer to Ritchard's comments on why the

org.apache.spark.sql.AnalysisException with registerTempTable

2015-10-15 Thread Yusuf Can Gürkan
Hello, I’m running some spark sql queries with registerTempTable function. But i get below error: org.apache.spark.sql.AnalysisException: resolved attribute(s) day#1680,year#1678,dt#1682,month#1679,hour#1681 missing from

Re: Best practices to handle corrupted records

2015-10-15 Thread Roberto Congiu
I came to a similar solution to a similar problem. I deal with a lot of CSV files from many different sources and they are often malformed. HOwever, I just have success/failure. Maybe you should make SuccessWithWarnings a subclass of success, or getting rid of it altogether making the warnings

Re: [SQL] Memory leak with spark streaming and spark sql in spark 1.5.1

2015-10-15 Thread Shixiong Zhu
Thanks for reporting it Terry. I submitted a PR to fix it: https://github.com/apache/spark/pull/9132 Best Regards, Shixiong Zhu 2015-10-15 2:39 GMT+08:00 Reynold Xin : > +dev list > > On Wed, Oct 14, 2015 at 1:07 AM, Terry Hoo wrote: > >> All, >> >>

Re: Spark 1.5 Streaming and Kinesis

2015-10-15 Thread Eugen Cepoi
Hey, A quick update on other things that have been tested. When looking at the compiled code of the spark-streaming-kinesis-asl jar everything looks normal (there is a class that implements SyncMap and it is used inside the receiver). Starting a spark shell and using introspection to instantiate

Re: Spark 1.5 Streaming and Kinesis

2015-10-15 Thread Phil Kallos
Not a dumb question, but yes I updated all of the library references to 1.5, including (even tried 1.5.1). // Versions.spark set elsewhere to "1.5.0" "org.apache.spark" %% "spark-streaming-kinesis-asl" % Versions.spark % "provided" I am experiencing the issue in my own spark project, but also

[no subject]

2015-10-15 Thread Lei Wu
Dear all, Like the design doc in SPARK-1 for Spark memory management, is there a design doc for Spark task scheduling details ? I'd really like to dive deep into the task scheduling module of Spark, thanks so much !

PMML export for LinearRegressionModel

2015-10-15 Thread Fazlan Nazeem
Hi I am trying to export a LinearRegressionModel in PMML format. According to the following resource[1] PMML export is supported for LinearRegressionModel. [1] https://spark.apache.org/docs/latest/mllib-pmml-model-export.html But there is *no* *toPMML* method in *LinearRegressionModel* class

Re: Spark 1.5 Streaming and Kinesis

2015-10-15 Thread Jean-Baptiste Onofré
Thanks for the update Phil. I'm preparing a environment to reproduce it. I keep you posted. Thanks again, Regards JB On 10/15/2015 08:36 AM, Phil Kallos wrote: Not a dumb question, but yes I updated all of the library references to 1.5, including (even tried 1.5.1). // Versions.spark set

Re: SPARK SQL Error

2015-10-15 Thread Giri
Hi Ritchard, Thank you so much again for your input.This time I ran the command in the below way spark-submit --master yarn --class org.spark.apache.CsvDataSource /home/cloudera/Desktop/TestMain.jar hdfs://quickstart.cloudera:8020/people_csv But I am facing the new error "Could not parse

RE: Running in cluster mode causes native library linking to fail

2015-10-15 Thread prajod.vettiyattil
Forwarding to the group, in case someone else has the same error. Just found out that I did not reply to the group in my original reply. From: Prajod S Vettiyattil (WT01 - BAS) Sent: 15 October 2015 11:45 To: 'Bernardo Vecchia Stein' Subject: RE: Running in cluster

Re: Spark on Mesos / Executor Memory

2015-10-15 Thread Bharath Ravi Kumar
Resending since user@mesos bounced earlier. My apologies. On Thu, Oct 15, 2015 at 12:19 PM, Bharath Ravi Kumar wrote: > (Reviving this thread since I ran into similar issues...) > > I'm running two spark jobs (in mesos fine grained mode), each belonging to > a different

How VectorIndexer works in Spark ML pipelines

2015-10-15 Thread VISHNU SUBRAMANIAN
HI All, I am trying to use the VectorIndexer (FeatureExtraction) technique available from the Spark ML Pipelines. I ran the example in the documentation . val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexedFeatures") .setMaxCategories(4) .fit(data)

Re: Spark 1.5 Streaming and Kinesis

2015-10-15 Thread Eugen Cepoi
So running it using spark-submit doesnt change anything, it still works. When reading the code https://github.com/apache/spark/blob/branch-1.5/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L100 it looks like the receivers are definitely being ser/de. I think

Re: Spark on Mesos / Executor Memory

2015-10-15 Thread Bharath Ravi Kumar
(Reviving this thread since I ran into similar issues...) I'm running two spark jobs (in mesos fine grained mode), each belonging to a different mesos role, say low and high. The low:high mesos weights are 1:10. On expected lines, I see that the low priority job occupies cluster resources to the

Re: spark-shell :javap fails with complaint about JAVA_HOME, but it is set correctly

2015-10-15 Thread Shixiong Zhu
Scala 2.10 REPL javap doesn't support Java7 or Java8. It was fixed in Scala 2.11. See https://issues.scala-lang.org/browse/SI-4936 Best Regards, Shixiong Zhu 2015-10-15 4:19 GMT+08:00 Robert Dodier : > Hi, > > I am working with Spark 1.5.1 (official release), with

Design doc for Spark task scheduling

2015-10-15 Thread Lei Wu
Like the design doc in SPARK-1 for Spark memory management, is there a design doc for Spark task scheduling details ? I'd really like to dive deep into the task scheduling module of Spark, thanks so much ! Forgot the email title in previous mail, sorry for that.

How to specify the numFeatures in HashingTF

2015-10-15 Thread Jianguo Li
Hi, There is a parameter in the HashingTF called "numFeatures". I was wondering what is the best way to set the value to this parameter. In the use case of text categorization, do you need to know in advance the number of words in your vocabulary? or do you set it to be a large value, greater

Re: Best practices to handle corrupted records

2015-10-15 Thread Antonio Murgia
'Either' does not cover the case where the outcome was successful but generated warnings. I already looked into it and also at 'Try' from which I got inspired. Thanks for pointing it out anyway! #A.M. Il giorno 15 ott 2015, alle ore 16:19, Erwan ALLAIN

Re: How to specify the numFeatures in HashingTF

2015-10-15 Thread Nick Pentreath
Setting the numfeatures higher than vocab size will tend to reduce the chance of hash collisions, but it's not strictly necessary - it becomes a memory / accuracy trade off. Surprisingly, the impact on model performance of moderate hash collisions is often not significant. So it may

Re: Best practices to handle corrupted records

2015-10-15 Thread Erwan ALLAIN
What about http://www.scala-lang.org/api/2.9.3/scala/Either.html ? On Thu, Oct 15, 2015 at 2:57 PM, Roberto Congiu wrote: > I came to a similar solution to a similar problem. I deal with a lot of > CSV files from many different sources and they are often malformed. >

How to enable Spark mesos docker executor?

2015-10-15 Thread Klaus Ma
Hi team, I'm working on integration between Mesos & Spark. For now, I can start SlaveMesosDispatcher in a docker; and I like to also run Spark executor in Mesos docker. I do the following configuration for it, but I got an error; any suggestion?Configuration:Spark:

Re: How to enable Spark mesos docker executor?

2015-10-15 Thread Timothy Chen
Hi Klaus, Sorry not next to a computer but it could possibily be a bug that it doesn't take SPARK_HOME as the base path. Currently the spark image seems to set the working directory so that it works. I'll look at the code to verify but seems like it could be the case. If it's true feel free

[Spark ML] How to extends MLlib's optimization algorithm

2015-10-15 Thread Zhiliang Zhu
Dear All, I would like to use spark ml to develop some project related with optimization algorithm, however, in spark 1.4.1 it seems that under ml's optimizer there are only about 2 optimization algorithms. My project may needs more kinds of optimization algorithms, then how would I use spark

Re: Spark streaming checkpoint against s3

2015-10-15 Thread Tian Zhang
So as long as jar is kept on s3 and available across different runs, then the s3 checkpoint is working. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-checkpoint-against-s3-tp25068p25081.html Sent from the Apache Spark User List mailing

RE: How to enable Spark mesos docker executor?

2015-10-15 Thread Klaus Ma
Hi Timothy, Thanks for your feedback, I logged https://issues.apache.org/jira/browse/SPARK-11143 to trace this issue. If any more suggestions, please let me know :). Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer Platform Symphony/DCOS Development & Support, STG, IBM GCG

Get list of Strings from its Previous State

2015-10-15 Thread Yogesh Vyas
Hi, I am new to Spark and was trying to do some experiments with it. I had a JavaPairDStream RDD. I want to get the list of string from its previous state. For that I use updateStateByKey function as follows: final Function2, Optional> updateFunc =

RE: dataframes and numPartitions

2015-10-15 Thread Mohammed Guller
You may find the spark.sql.shuffle.partitions property useful. The default value is 200. Mohammed From: Alex Nastetsky [mailto:alex.nastet...@vervemobile.com] Sent: Wednesday, October 14, 2015 8:14 PM To: user Subject: dataframes and numPartitions A lot of RDD methods take a numPartitions

RE: Spark SQL running totals

2015-10-15 Thread Stefan Panayotov
Thanks to all of you guys for the helpful suggestions. I'll try these first thing tomorrow morning. Stefan Panayotov Sent from my Windows Phone From: java8964 Sent: ‎10/‎15/‎2015 4:30 PM To: Michael

RE: Spark SQL running totals

2015-10-15 Thread java8964
My mistake. I didn't noticed "UNBOUNDED PRECEDING" already supported. So cumulative sum should work then. Thanks Yong From: java8...@hotmail.com To: mich...@databricks.com; deenar.toras...@gmail.com CC: spanayo...@msn.com; user@spark.apache.org Subject: RE: Spark SQL running totals Date: Thu, 15

RE: Spark SQL running totals

2015-10-15 Thread java8964
Not sure the windows function can work for his case. If you do a "sum() over (partitioned by)", that will return a total sum per partition, instead of a cumulative sum wanted in this case. I saw there is a "cume_dis", but no "cume_sum". Do we really have a "cume_sum" in Spark window function, or

Re: Problem installing Sparck on Windows 8

2015-10-15 Thread Marco Mistroni
Hi i t ried to set this variable in my windows env variables but got same result this si the result of calling set in my command prompt have i amended it in the wrong place? kr marco .. USERDOMAIN=MarcoLaptop USERDOMAIN_ROAMINGPROFILE=MarcoLaptop USERNAME=marco USERPROFILE=C:\Users\marco

Re: Spark 1.5 Streaming and Kinesis

2015-10-15 Thread Jean-Baptiste Onofré
Hi Phil, sorry I didn't have time to investigate yesterday (I was on a couple of other Apache projects ;)). I will try to do it today. I keep you posted. Regards JB On 10/16/2015 07:21 AM, Phil Kallos wrote: JB, To clarify, you are able to run the Amazon Kinesis example provided in the

Get the previous state string in Spark streaming

2015-10-15 Thread Chandra Mohan, Ananda Vel Murugan
One of my co-worker(Yogesh) was trying to get this posted in spark mailing and it seems it did not get posted. So I am reposting it here. Please help. Hi, I am new to Spark and was trying to do some experiments with it. I had a JavaPairDStream RDD. I want to get the list

Is Feature Transformations supported by Spark export to PMML

2015-10-15 Thread Weiwei Zhang
Hi Folks, I am trying to find out if the Spark export to PMML has support for feature transformations. I know in R, I need to specify local transformations and attributes using the "pmml" and "pmmlTransformation" libraries. The example I read on Spark, simply apply "toPMML" function and it

Re: s3a file system and spark deployment mode

2015-10-15 Thread Raghavendra Pandey
You can use spark 1.5.1 with no hadoop and hadoop 2.7.1.. Hadoop 2.7.1 is more mature for s3a access. You also need to set hadoop tools dir into hadoop classpath... Raghav On Oct 16, 2015 1:09 AM, "Scott Reynolds" wrote: > We do not use EMR. This is deployed on Amazon VMs

Re: Spark 1.5 Streaming and Kinesis

2015-10-15 Thread Phil Kallos
JB, To clarify, you are able to run the Amazon Kinesis example provided in the spark examples dir? bin/run-example streaming.KinesisWordCountASL [app name] [stream name] [endpoint url] ? If it helps, below are the steps I used to build spark mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests

Turn off logs in spark-sql shell

2015-10-15 Thread Muhammad Ahsan
Hello Everyone! I want to know how to turn off logging during starting *spark-sql shell* without changing log4j configuration files. In normal spark-shell I can use the following commands import org.apache.log4j.Loggerimport org.apache.log4j.Level

Application not found in Spark historyserver in yarn-client mode

2015-10-15 Thread Anfernee Xu
Sorry, I have to re-send it again as I did not get the answer. Here's the problem I'm facing, I'm using Spark 1.5.0 release, I have a standalone java application which is periodically submit Spark jobs to my yarn cluster, btw I'm not using 'spark-submit' or 'org.apache.spark.launcher' to submit

Re:

2015-10-15 Thread Dirceu Semighini Filho
Hi Anfemee, Subject in the email sometimes help ;) Have you seen if the link is sending you to a hostname that is not accessible by your workstation? Sometimes changing the hostname to the ip solve this kind of issue. 2015-10-15 13:34 GMT-03:00 Anfernee Xu : > Sorry, I

[no subject]

2015-10-15 Thread Anfernee Xu
Sorry, I have to re-send it again as I did not get the answer. Here's the problem I'm facing, I have a standalone java application which is periodically submit Spark jobs to my yarn cluster, btw I'm not using 'spark-submit' or 'org.apache.spark.launcher' to submit my jobs. These jobs are

word2vec cosineSimilarity

2015-10-15 Thread Arthur Chan
Hi, I am trying sample word2vec from http://spark.apache.org/docs/latest/mllib-feature-extraction.html#example Following are my test results: scala> for((synonym, cosineSimilarity) <- synonyms) { | println(s"$synonym $cosineSimilarity") | } taiwan 2.0518918365726297 japan

Complex transformation on a dataframe column

2015-10-15 Thread Hao Wang
Hi, I have searched around but could not find a satisfying answer to this question: what is the best way to do a complex transformation on a dataframe column? For example, I have a dataframe with the following schema and a function that has pretty complex logic to format addresses. I would

Re: SQL Context error in 1.5.1 - any work around ?

2015-10-15 Thread Richard Hillegas
A crude workaround may be to run your spark shell with a sudo command. Hope this helps, Rick Hillegas Sourav Mazumder wrote on 10/15/2015 09:59:02 AM: > From: Sourav Mazumder > To: user > Date: 10/15/2015 09:59

SQL Context error in 1.5.1 - any work around ?

2015-10-15 Thread Sourav Mazumder
I keep on getting this error whenever I'm starting spark-shell : The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx--. I cannot work with this if I need to do anything with sqlContext as that does not get created. I could see that a bug is raised for

Re: SPARK SQL Error

2015-10-15 Thread pnpritchard
Going back to your code, I see that you instantiate the spark context as: val sc = new SparkContext(args(0), "Csv loading example") which will set the master url to "args(0)" and app name to "Csv loading example". In your case, args(0) is "hdfs://quickstart.cloudera:8020/people_csv", which

Re: Spark 1.5 java.net.ConnectException: Connection refused

2015-10-15 Thread legolasluk
In case of job being aborted does the application fail, if not then how do you make it fail?  How should streaming application recover without data loss? How does this apply to zero data loss behavior for Kinesis receivers in Spark 1.5? Thanks, Ashish  Original message From:

Re: s3a file system and spark deployment mode

2015-10-15 Thread Spark Newbie
Are you using EMR? You can install Hadoop-2.6.0 along with Spark-1.5.1 in your EMR cluster. And that brings s3a jars to the worker nodes and it becomes available to your application. On Thu, Oct 15, 2015 at 11:04 AM, Scott Reynolds wrote: > List, > > Right now we build our

Re: s3a file system and spark deployment mode

2015-10-15 Thread Scott Reynolds
We do not use EMR. This is deployed on Amazon VMs We build Spark with Hadoop-2.6.0 but that does not include the s3a filesystem nor the Amazon AWS SDK On Thu, Oct 15, 2015 at 12:26 PM, Spark Newbie wrote: > Are you using EMR? > You can install Hadoop-2.6.0 along with

Re: Spark 1.5 Streaming and Kinesis

2015-10-15 Thread Jean-Baptiste Onofré
Hi Phil, KinesisReceiver is part of extra. Just a dumb question: did you update all, including the Spark Kinesis extra containing the KinesisReceiver ? I checked on tag v1.5.0, and at line 175 of the KinesisReceiver, we see: blockIdToSeqNumRanges.clear() which is a: private val

Re: Spark 1.5 Streaming and Kinesis

2015-10-15 Thread Jean-Baptiste Onofré
By correct, I mean: the map declaration looks good to me, so the ClassCastException is weird ;) I'm trying to reproduce the issue in order to investigate. Regards JB On 10/15/2015 08:03 AM, Jean-Baptiste Onofré wrote: Hi Phil, KinesisReceiver is part of extra. Just a dumb question: did you

Re: Spark UI consuming lots of memory

2015-10-15 Thread Nicholas Pritchard
Thanks for your help, most likely this is the memory leak you are fixing in https://issues.apache.org/jira/browse/SPARK-11126. -Nick On Mon, Oct 12, 2015 at 9:00 PM, Shixiong Zhu wrote: > In addition, you cannot turn off JobListener and SQLListener now... > > Best Regards, >

Spark 1.5.1 ThriftServer

2015-10-15 Thread Dirceu Semighini Filho
Hello, I'm trying to migrate to scala 2.11 and I didn't found a spark-thriftserver jar for scala 2.11 in maven repository. I could a manual build (without tests) the spark with thriftserver in scala 2.11. Sometime ago the thrift server build wasn't enabled by default, but I can find a 2.10 jar for

multiple pyspark instances simultaneously (same time)

2015-10-15 Thread jeff.sadow...@gmail.com
I am having issues trying to setup spark to run jobs simultaneously. I thought I wanted FAIR scheduling? I used the templated fairscheduler.xml as is when I start pyspark I see the 3 expected pools: production, test, and default when I login as second user and run pyspark I see the expected

s3a file system and spark deployment mode

2015-10-15 Thread Scott Reynolds
List, Right now we build our spark jobs with the s3a hadoop client. We do this because our machines are only allowed to use IAM access to the s3 store. We can build our jars with the s3a filesystem and the aws sdk just fine and this jars run great in *client mode*. We would like to move from

Re: Spark SQL running totals

2015-10-15 Thread Deenar Toraskar
you can do a self join of the table with itself with the join clause being a.col1 >= b.col1 select a.col1, a.col2, sum(b.col2) from tablea as a left outer join tablea as b on (a.col1 >= b.col1) group by a.col1, a.col2 I havent tried it, but cant see why it cant work, but doing it in RDD might be

Spark SQL running totals

2015-10-15 Thread Stefan Panayotov
Hi, I need help with Spark SQL. I need to achieve something like the following. If I have data like: col_1 col_2 1 10 2 30 3 15 4 20 5 25 I need to get col_3 to be the running total of the sum of the previous rows of col_2, e.g. col_1 col_2 col_3

Re: Spark SQL running totals

2015-10-15 Thread Michael Armbrust
Check out: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html On Thu, Oct 15, 2015 at 11:35 AM, Deenar Toraskar wrote: > you can do a self join of the table with itself with the join clause being > a.col1 >= b.col1 > > select

Re: Spark SQL running totals

2015-10-15 Thread Kristina Rogale Plazonic
You can do it and many other transformations very easily with window functions, see this blog post: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html In your case you would do (in Scala): import org.apache.spark.sql.expressions.Window import

Re: Spark 1.5 java.net.ConnectException: Connection refused

2015-10-15 Thread Spark Newbie
What is the best way to fail the application when job gets aborted? On Wed, Oct 14, 2015 at 1:27 PM, Tathagata Das wrote: > When a job gets aborted, it means that the internal tasks were retried a > number of times before the system gave up. You can control the number >