Re: mllib on (key, Iterable[Vector])

2015-08-11 Thread Feynman Liang
You could try flatMapping i.e. if you have data : RDD[(key, Iterable[Vector])] then data.flatMap(_._2) : RDD[Vector], which can be GMMed. If you want to first partition by url, I would first create multiple RDDs using `filter`, then running GMM on each of the filtered rdds. On Tue, Aug 11, 2015

Re: Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
I am also trying to understand how are files named when writing to hadoop? for eg: how does saveAs method ensures that each executor is generating unique files? On Tue, Aug 11, 2015 at 4:21 PM, ayan guha guha.a...@gmail.com wrote: partitioning - by itself - is a property of RDD. so essentially

Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Yu
Yan: Where can I find performance numbers for Astro (it's close to middle of August) ? Cheers On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Finally I can take a look at HBASE-14181 now. Unfortunately there is no design doc mentioned. Superficially it is very

Re: Sporadic Input validation failed error when executing LogisticRegressionWithLBFGS.train

2015-08-11 Thread ai he
Hi Francis, From my observation when using spark sql, dataframe.limit(n) does not necessarily return the same result each time when running Apps. To be more precise, in one App, the result should be same for the same n, however, changing n might not result in the same prefix(the result for n =

Re: ClassNotFound spark streaming

2015-08-11 Thread Mohit Anchlia
After changing the '--deploy_mode client' the program seems to work however it looks like there is a bug in spark when using --deploy_mode as 'yarn'. Should I open a bug? On Tue, Aug 11, 2015 at 3:02 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I see the following line in the log 15/08/11

Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
How does partitioning in spark work when it comes to streaming? What's the best way to partition a time series data grouped by a certain tag like categories of product video, music etc.

RE: Refresh table

2015-08-11 Thread Cheng, Hao
Refreshing table only works for the Spark SQL DataSource in my understanding, apparently here, you’re running a Hive Table. Can you try to create a table like: |CREATE TEMPORARY TABLE parquetTable (a int, b string) |USING org.apache.spark.sql.parquet.DefaultSource

spark vs flink low memory available

2015-08-11 Thread parö
hi community, i have build a spark and flink k-means application. my test case is a clustering on 1 million points on 3node cluster. in memory bottlenecks begins flink to outsource to disk and work slowly but works. however spark lose executers if the memory is full and starts again (infinety

Re: SparkR -Graphx Connected components

2015-08-11 Thread Robineast
To be part of a strongly connected component every vertex must be reachable from every other vertex. Vertex 6 is not reachable from the other components of scc 0. Same goes for 7. So both 6 and 7 form their own strongly connected components. 6 and 7 are part of the connected components of 0 and 3

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-08-11 Thread Haripriya Ayyalasomayajula
Hello all, As a quick follow up for this, I have been using Spark on Yarn till now and am currently exploring Mesos and Marathon. Using yarn, we could tell the spark job about the number of executors and number of cores as well, is there a way to do it on mesos? I'm using Spark 1.4.1 on Mesos

Re: spark vs flink low memory available

2015-08-11 Thread Ted Yu
Pa: Can you try 1.5.0 SNAPSHOT ? See SPARK-7075 Project Tungsten (Spark 1.5 Phase 1) Cheers On Tue, Aug 11, 2015 at 12:49 AM, jun kit...@126.com wrote: your detail of log file? At 2015-08-10 22:02:16, Pa Rö paul.roewer1...@googlemail.com wrote: hi community, i have build a spark and

Re: Spark Streaming dealing with broken files without dying

2015-08-11 Thread Akhil Das
You can do something like this: val fStream = ssc.textFileStream(/sigmoid/data/) .map(x = { try{ //Move all the transformations within a try..catch }catch{ case e: Exception = { logError(Whoops!! ); null } } }) Thanks Best Regards On Mon, Aug 10, 2015 at 7:44 PM, Mario Pastorelli

AW: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-08-11 Thread rene.pfitzner
Hi – I'd like to follow up on this, as I am running into very similar issues (with a much bigger data set, though – 10^5 nodes, 10^7 edges). So let me repost the question: Any ideas on how to estimate graphx memory requirements? Cheers! Von: Roman Sokolov [mailto:ole...@gmail.com] Gesendet:

Re: Controlling number of executors on Mesos vs YARN

2015-08-11 Thread Haripriya Ayyalasomayajula
Hi Tim, Spark on Yarn allows us to do it using --num-executors and --executor_cores commandline arguments. I just got a chance to look at a similar spark user list mail, but no answer yet. So does mesos allow setting the number of executors and cores? Is there a default number it assumes? On

Python3 Spark execution problems

2015-08-11 Thread Javier Domingo Cansino
Hi, I have been trying to use spark for the processing I need to do in some logs, and I have found several difficulties during the process. Most of them I could overcome them, but I am really stuck in the last one. I would really like to know how spark is supposed to be deployed. For now, I have

Error while output JavaDStream to disk and mongodb

2015-08-11 Thread Deepesh Maheshwari
Hi, I have successfully reduced my data and store it in JavaDStreamBSONObject Now, i want to save this data in mongodb for this i have used BSONObject type. But, when i try to save it, it is giving exception. For this, i also try to save it just as *saveAsTextFile *but same exception. Error

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-08-11 Thread Rick Moritz
Consider the spark.max.cores configuration option -- it should do what you require. On Tue, Aug 11, 2015 at 8:26 AM, Haripriya Ayyalasomayajula aharipriy...@gmail.com wrote: Hello all, As a quick follow up for this, I have been using Spark on Yarn till now and am currently exploring Mesos

Re: Differents of loading data

2015-08-11 Thread Akhil Das
Load data to where? To spark? If you are referring to spark, then there are some differences the way the connector is implemented. When you use spark, the most important thing that you get is the parallelism (depending on the number of partitions). If you compare it with a native java driver then

Re: spark vs flink low memory available

2015-08-11 Thread Pa Rö
my first post is here and a log too: http://mail-archives.us.apache.org/mod_mbox/spark-user/201508.mbox/%3ccah2_pykqhfr4tbvpbt2tdhgm+zrkcbzfnk7uedkjpdhe472...@mail.gmail.com%3E i use cloudera live, i think i can not use spark 1.5. i will try to run it again and post the current logfile here.

Re: ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver

2015-08-11 Thread Sadaf Khan
okay. Then do you have any idea how to avoid this error? Thanks On Tue, Aug 11, 2015 at 12:08 AM, Tathagata Das t...@databricks.com wrote: I think this may be expected. When the streaming context is stopped without the SparkContext, then the receivers are stopped by the driver. The receiver

Re: Java Streaming Context - File Stream use

2015-08-11 Thread Akhil Das
Like this: (Including the filter function) JavaPairInputDStreamLongWritable, Text inputStream = ssc.fileStream( testDir.toString(), LongWritable.class, Text.class, TextInputFormat.class, new FunctionPath, Boolean() { @Override public Boolean call(Path

Python3 Spark execution problems

2015-08-11 Thread Javier Domingo Cansino
Hi, I have been trying to use spark for the processing I need to do in some logs, and I have found several difficulties during the process. Most of them I could overcome them, but I am really stuck in the last one. I would really like to know how spark is supposed to be deployed. For now, I have

Re: Inquery about contributing codes

2015-08-11 Thread Akhil Das
You can create a new Issue and send a pull request for the same i think. + dev list Thanks Best Regards On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon gurwls...@gmail.com wrote: Dear Sir / Madam, I have a plan to contribute some codes about passing filters to a datasource as physical

Re: mllib kmeans produce 1 large and many extremely small clusters

2015-08-11 Thread sooraj
Hi, The issue is very likely to be in the data or the transformations you apply, rather than anything to do with the Spark Kmeans API as such. I'd start debugging by doing a bit of exploratory analysis of the TFIDF vectors. That is, for instance, plot the distribution (histogram) of the TFIDF

Re: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Yu
HBase will not have query engine. It will provide better support to query engines. Cheers On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Ted, I’m in China now, and seem to experience difficulty to access Apache Jira. Anyways, it appears to me that

Fwd: How to minimize shuffling on Spark dataframe Join?

2015-08-11 Thread Abdullah Anwar
I have two dataframes like this student_rdf = (studentid, name, ...) student_result_rdf = (studentid, gpa, ...) we need to join this two dataframes. we are now doing like this, student_rdf.join(student_result_rdf, student_result_rdf[studentid] == student_rdf[studentid]) So it is simple.

Re: Spark with GCS Connector - Rate limit error

2015-08-11 Thread Akhil Das
There's a daily quota and a minutely quota, you could be hitting those. You can ask google to increase the quota for the particular service. Now, to reduce the limit from the spark side, you can actually to a re-partition to a smaller number before doing the save. Another way to use the local file

How to specify column type when saving DataFrame as parquet file?

2015-08-11 Thread Jyun-Fan Tsai
Hi all, I'm using Spark 1.4.1. I create a DataFrame from json file. There is a column C that all values are null in the json file. I found that the datatype of column C in the created DataFrame is string. However, I would like to specify the column as Long when saving it as parquet file. What

dse spark-submit multiple jars issue

2015-08-11 Thread satish chandra j
*HI,* Please let me know if i am missing anything in the command below *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/dse.jar

Re: dse spark-submit multiple jars issue

2015-08-11 Thread Javier Domingo Cansino
use --verbose, it might give you some insights on what0s happening, [image: Fon] http://www.fon.com/Javier Domingo CansinoResearch Development Engineer+34 946545847Skype: javier.domingo.fonAll information in this email is confidential http://corp.fon.com/legal/email-disclaimer On Tue, Aug

Re: Controlling number of executors on Mesos vs YARN

2015-08-11 Thread Jerry Lam
My experience with Mesos + Spark is not great. I saw one executor with 30 CPU and the other executor with 6. So I don't think you can easily configure it without some tweaking at the source code. Sent from my iPad On 2015-08-11, at 2:38, Haripriya Ayyalasomayajula aharipriy...@gmail.com

Re: dse spark-submit multiple jars issue

2015-08-11 Thread Javier Domingo Cansino
I have no real idea (not java user), but have you tried with the --jars option? http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management AFAIK, you are currently submitting the jar names as arguments to the called Class instead of the jars themselves

Do you have any other method to get cpu elapsed time of an spark application

2015-08-11 Thread JoneZhang
Is there more information about spark evenlog, for example Why did not appear the SparkListenerExecutorRemoved event in evenlog while i use dynamic executor? I want to calculate cpu elapsed time of an application base on evenlog. By the way, Do you have any other method to get cpu elapsed time

Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
Hi all, I don't have any hadoop fs installed on my environment, but I would like to store dataframes in parquet files. I am failing to do so, if possible, anyone have any pointers? Thank you, Saif

RE: Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
I am launching my spark-shell spark-1.4.1-bin-hadoop2.6/bin/spark-shell 15/08/11 09:43:32 INFO SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. scala val data = sc.parallelize(Array(2,3,5,7,2,3,6,1)).toDF scala data.write.parquet(/var/ data/Saif/pq)

Unsupported major.minor version 51.0

2015-08-11 Thread Yakubovich, Alexey
I found some discussions online, but it all cpome to advice to use JDF 1.7 (or 1.8). Well, I use JDK 1.7 on OS X Yosemite . Both java –verion // java version 1.7.0_80 Java(TM) SE Runtime Environment (build 1.7.0_80-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode) and echo

Re: Unsupported major.minor version 51.0

2015-08-11 Thread Ted Yu
What does the following command say ? mvn -version Maybe you are using an old maven ? Cheers On Tue, Aug 11, 2015 at 7:55 AM, Yakubovich, Alexey alexey.yakubov...@searshc.com wrote: I found some discussions online, but it all cpome to advice to use JDF 1.7 (or 1.8). Well, I use JDK 1.7 on

Re: dse spark-submit multiple jars issue

2015-08-11 Thread satish chandra j
HI, Please find the log details below: dse spark-submit --verbose --master local --class HelloWorld etl-0.0.1-SNAPSHOT.jar --jars file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar file:/home/missingmerch/dse.jar file:/home/missingmerch/postgresql-9.4-1201.jdbc41.jar Using properties file:

RE: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-11 Thread Benjamin Ross
Jerry, I was able to use window functions without the hive thrift server. HiveContext does not imply that you need the hive thrift server running. Here’s what I used to test this out: var conf = new SparkConf(true).set(spark.cassandra.connection.host, 127.0.0.1) val sc = new

RE: Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
I confirm that it works, I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450 Saif From: Ellafi, Saif A. Sent: Tuesday, August 11, 2015 12:01 PM To: Ellafi, Saif A.; deanwamp...@gmail.com Cc: user@spark.apache.org Subject: RE: Parquet without hadoop: Possible? Sorry,

PySpark order-only window function issue

2015-08-11 Thread Maciej Szymkiewicz
Hello everyone, I am trying to use PySpark API with window functions without specifying partition clause. I mean something equivalent to this SELECT v, row_number() OVER (ORDER BY v) AS rn FROM df in SQL. I am not sure if I am doing something wrong or it is a bug but results are far from what I

RE: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-11 Thread Benjamin Ross
I forgot to mention, my setup was: - Spark 1.4.1 running in standalone mode - Datastax spark cassandra connector 1.4.0-M1 - Cassandra DB - Scala version 2.10.4 From: Benjamin Ross Sent: Tuesday, August 11, 2015 10:16 AM To: Jerry; Michael Armbrust Cc: user

Re: Questions about SparkSQL join on not equality conditions

2015-08-11 Thread gen tang
Hi, After taking a look at the code, I found out the problem: As spark will use broadcastNestedLoopJoin to treat nonequality condition. And one of my dataframe(df1) is created from an existing RDD(logicalRDD), so it uses defaultSizeInBytes * length to estimate the size. The other dataframe(df2)

Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-11 Thread Akhil Das
Hi My Spark job (running in local[*] with spark 1.4.1) reads data from a thrift server(Created an RDD, it will compute the partitions in getPartitions() call and in computes hasNext will return records from these partitions), count(), foreach() is working fine it returns the correct number of

Re: Parquet without hadoop: Possible?

2015-08-11 Thread Dean Wampler
It should work fine. I have an example script here: https://github.com/deanwampler/spark-workshop/blob/master/src/main/scala/sparkworkshop/SparkSQLParquet10-script.scala (Spark 1.4.X) What does I am failing to do so mean? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: Controlling number of executors on Mesos vs YARN

2015-08-11 Thread Haripriya Ayyalasomayajula
Spark evolved as an example framework for Mesos - thats how I know it. It is surprising to see that the options provided by mesos in this case are less. Tweaking the source code, haven't done it yet but I would love to see what options could be there! On Tue, Aug 11, 2015 at 5:42 AM, Jerry Lam

RE: Parquet without hadoop: Possible?

2015-08-11 Thread Saif.A.Ellafi
Sorry, I provided bad information. This example worked fine with reduced parallelism. It seems my problem have to do with something specific with the real data frame at reading point. Saif From: saif.a.ell...@wellsfargo.com [mailto:saif.a.ell...@wellsfargo.com] Sent: Tuesday, August 11, 2015

Re: avoid duplicate due to executor failure in spark stream

2015-08-11 Thread Shushant Arora
What if processing is neither idempotent nor its in transaction ,say I am posting events to some external server after processing. Is it possible to get accumulator of failed task in retry task? Is there any way to detect whether this task is retried task or original task ? I was trying to

unsubscribe

2015-08-11 Thread Michel Robert
Michel Robert Almaden Research Center EDA - IBM Systems and Technology Group Phone: (408) 927-2117 T/L 8-457-2117 E-mail: m...@us.ibm.com

Unsupported major.minor version 51.0

2015-08-11 Thread alexeyy3
I found some discussions online, but it all cpome to advice to use JDF 1.7 (or 1.8). Well, I use JDK 1.7 on OS X Yosemite . Both java –verion // java version 1.7.0_80 Java(TM) SE Runtime Environment (build 1.7.0_80-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode) and echo

Re: Parquet without hadoop: Possible?

2015-08-11 Thread Jerry Lam
Just out of curiosity, what is the advantage of using parquet without hadoop? Sent from my iPhone On 11 Aug, 2015, at 11:12 am, saif.a.ell...@wellsfargo.com wrote: I confirm that it works, I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450 Saif From:

Re: Spark DataFrames uses too many partition

2015-08-11 Thread Silvio Fiorito
You need to configure the spark.sql.shuffle.partitions parameter to a different value. It defaults to 200. On 8/11/15, 11:31 AM, Al M alasdair.mcbr...@gmail.com wrote: I am using DataFrames with Spark 1.4.1. I really like DataFrames but the partitioning makes no sense to me. I am loading

Re: unsubscribe

2015-08-11 Thread Ted Yu
See first section of http://spark.apache.org/community.html On Tue, Aug 11, 2015 at 9:47 AM, Michel Robert m...@us.ibm.com wrote: Michel Robert Almaden Research Center EDA - IBM Systems and Technology Group Phone: (408) 927-2117 T/L 8-457-2117 E-mail: m...@us.ibm.com

Application failed error

2015-08-11 Thread Anubhav Agarwal
I am running Spark 1.3 on CDH 5.4 stack. I am getting the following error when I spark-submit my application:- 15/08/11 16:03:49 INFO Remoting: Starting remoting 15/08/11 16:03:49 INFO Remoting: Remoting started; listening on addresses

Re: Spark job workflow engine recommendations

2015-08-11 Thread Ruslan Dautkhanov
We use Talend, but not for Spark workflows. Although it does have Spark componenets. https://www.talend.com/download/talend-open-studio It is free (commercial support available), easy to design and deploy workflows. Talend for BigData 6.0 was released as month ago. Is anybody using Talend for

Re: Accessing S3 files with s3n://

2015-08-11 Thread Steve Loughran
On 10 Aug 2015, at 20:17, Akshat Aranya aara...@gmail.commailto:aara...@gmail.com wrote: Hi Jerry, Akhil, Thanks your your help. With s3n, the entire file is downloaded even while just creating the RDD with sqlContext.read.parquet(). It seems like even just opening and closing the

Re: Spark job workflow engine recommendations

2015-08-11 Thread Hien Luu
We are in the middle of figuring that out. At the high level, we want to combine the best parts of existing workflow solutions. On Fri, Aug 7, 2015 at 3:55 PM, Vikram Kone vikramk...@gmail.com wrote: Hien, Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin going to

Does print/event logging affect performance?

2015-08-11 Thread Saif.A.Ellafi
Hi all, silly question. Does logging info messages, both print or to file, or event logging, cause any impact to general performance of spark? Saif

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-11 Thread Philip Weaver
Do you think it might be faster to put all the files in one directory but still partitioned the same way? I don't actually need to filter on the values of the partition keys, but I need to rely on there be no overlap in the value of the keys between any two parquet files. On Fri, Aug 7, 2015 at

Re: Unsupported major.minor version 51.0

2015-08-11 Thread Ritesh Kumar Singh
Can you please mention the output for the following : java -version javac -version

Spark Job Hangs on our production cluster

2015-08-11 Thread java8964
Currently we have a IBM BigInsight cluster with 1 namenode + 1 JobTracker + 42 data/task nodes, which runs with BigInsight V3.0.0.2, corresponding with Hadoop 2.2.0 with MR1. Since IBM BigInsight doesn't come with Spark, so we build Spark 1.2.2 with Hadoop 2.2.0 + Hive 0.12 by ourselves, and

Re: Does print/event logging affect performance?

2015-08-11 Thread Ted Yu
What level of logging are you looking at ? At INFO level, there shouldn't be noticeable difference. On Tue, Aug 11, 2015 at 12:24 PM, saif.a.ell...@wellsfargo.com wrote: Hi all, silly question. Does logging info messages, both print or to file, or event logging, cause any impact to general

Re: ClassNotFound spark streaming

2015-08-11 Thread Mohit Anchlia
I see the following line in the log 15/08/11 17:59:12 ERROR spark.SparkContext: Jar not found at file:/home/ec2-user/./spark-streaming-test-0.0.1-SNAPSHOT-jar-with-dependencies.jar, however I do see that this file exists on all the node in that path. Not sure what's happening here. Please note I

grouping by a partitioned key

2015-08-11 Thread Philip Weaver
If I have an RDD that happens to already be partitioned by a key, how efficient can I expect a groupBy operation to be? I would expect that Spark shouldn't have to move data around between nodes, and simply will have a small amount of work just checking the partitions to discover that it doesn't

adding a custom Scala RDD for use in PySpark

2015-08-11 Thread Eric Walker
Hi, I'm new to Scala, Spark and PySpark and have a question about what approach to take in the problem I'm trying to solve. I have noticed that working with HBase tables read in using `newAPIHadoopRDD` can be quite slow with large data sets when one is interested in only a small subset of the

Re: grouping by a partitioned key

2015-08-11 Thread Eugene Morozov
Philip, If all data per key are inside just one partition, then Spark will figure that out. Can you guarantee that’s the case? What is it you try to achieve? There might be another way for it, when you might be 100% sure what’s happening. You can print debugString or explain (for DataFrame) to

Re: What is the optimal approach to do Secondary Sort in Spark?

2015-08-11 Thread Kevin Jung
You should create key as tuple type. In your case, RDD[((id, timeStamp) , value)] is the proper way to do. Kevin --- Original Message --- Sender : swethaswethakasire...@gmail.com Date : 2015-08-12 09:37 (GMT+09:00) Title : What is the optimal approach to do Secondary Sort in Spark? Hi,

Re: Exception in spark

2015-08-11 Thread Josh Rosen
Can you share a query or stack trace? More information would make this question easier to answer. On Tue, Aug 11, 2015 at 8:50 PM, Ravisankar Mani rrav...@gmail.com wrote: Hi all, We got an exception like “org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to

Re: Spark job workflow engine recommendations

2015-08-11 Thread Nick Pentreath
I also tend to agree that Azkaban is somehqat easier to get set up. Though I haven't used the new UI for Oozie that is part of CDH, so perhaps that is another good option. It's a pity Azkaban is a little rough in terms of documenting its API, and the scalability is an issue. However it

Re: Exception in spark

2015-08-11 Thread Ravisankar Mani
Hi Josh Please ignore the last mail stack trace. Kindly refer the exception details. {org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'Sheet1.Teams} Regards, Ravi On Wed, Aug 12, 2015 at 1:34 AM, Ravisankar Mani

Re: Boosting spark.yarn.executor.memoryOverhead

2015-08-11 Thread Sandy Ryza
Hi Eric, This is likely because you are putting the parameter after the primary resource (latest_msmtdt_by_gridid_and_source.py), which makes it a parameter to your application instead of a parameter to Spark/ -Sandy On Wed, Aug 12, 2015 at 4:40 AM, Eric Bless eric.bl...@yahoo.com.invalid

What is the optimal approach to do Secondary Sort in Spark?

2015-08-11 Thread swetha
Hi, What is the optimal approach to do Secondary sort in Spark? I have to first Sort by an Id in the key and further sort it by timeStamp which is present in the value. Thanks, Swetha -- View this message in context:

Re: Exception in spark

2015-08-11 Thread Ravisankar Mani
Hi Rosan, Thanks for your response. Kindly refer the following query and stack trace. I have checked same query in hive, It works perfectly. In case i have removed in in where class, it works in spark SELECT If(ISNOTNULL(SUM(`Sheet1`.`Runs`)),SUM(`Sheet1`.`Runs`),0) AS `Sheet1_Runs`

Re: Not seeing Log messages

2015-08-11 Thread Spark Enthusiast
Forgot to mention. Here is how I run the program :  ./bin/spark-submit --conf spark.app.master=local[1] ~/workspace/spark-python/ApacheLogWebServerAnalysis.py On Wednesday, 12 August 2015 10:28 AM, Spark Enthusiast sparkenthusi...@yahoo.in wrote: I wrote a small python program : def

Re: Partitioning in spark streaming

2015-08-11 Thread ayan guha
partitioning - by itself - is a property of RDD. so essentially it is no different in case of streaming where each batch is one RDD. You can use partitionBy on RDD and pass on your custom partitioner function to it. One thing you should consider is how balanced are your partitions ie your

RE: Spark DataFrames uses too many partition

2015-08-11 Thread Cheng, Hao
That's a good question, we don't support reading small files in a single partition yet, but it's definitely an issue we need to optimize, do you mind to create a jira issue for this? Hopefully we can merge that in 1.6 release. 200 is the default partition number for parallel tasks after the

Re: How to minimize shuffling on Spark dataframe Join?

2015-08-11 Thread Hemant Bhanawat
Is the source of your dataframe partitioned on key? As per your mail, it looks like it is not. If that is the case, for partitioning the data, you will have to shuffle the data anyway. Another part of your question is - how to co-group data from two dataframes based on a key? I think for RDD's

Re: grouping by a partitioned key

2015-08-11 Thread Philip Weaver
Thanks. In my particular case, I am calculating a distinct count on a key that is unique to each partition, so I want to calculate the distinct count within each partition, and then sum those. This approach will avoid moving the sets of that key around between nodes, which would be very

Error when running SparkPi in Intellij

2015-08-11 Thread canan chen
I import the spark project into intellij, and try to run SparkPi in intellij, but failed with compilation error: Error:scalac: while compiling: /Users/werere/github/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala during phase: jvm library version:

Re: Partitioning in spark streaming

2015-08-11 Thread Hemant Bhanawat
Posting a comment from my previous mail post: When data is received from a stream source, receiver creates blocks of data. A new block of data is generated every blockInterval milliseconds. N blocks of data are created during the batchInterval where N = batchInterval/blockInterval. A RDD is

Re: Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
Thanks for the info. When data is written in hdfs how does spark keeps the filenames written by multiple executors unique On Tue, Aug 11, 2015 at 9:35 PM, Hemant Bhanawat hemant9...@gmail.com wrote: Posting a comment from my previous mail post: When data is received from a stream source,

Not seeing Log messages

2015-08-11 Thread Spark Enthusiast
I wrote a small python program : def parseLogs(self): Read and parse log file self._logger.debug(Parselogs() start) self.parsed_logs = (self._sc .textFile(self._logFile) .map(self._parseApacheLogLine) .cache())

Spark 1.4.0 Docker Slave GPU Access

2015-08-11 Thread Nastooh Avessta (navesta)
Hi Trying to access GPU from a Spark 1.4.0 Docker slave, without much luck. In my Spark program, I make a system call to a script, which performs various calculations using GPU. I am able to run this script as standalone, or via Mesos Marathon; however, calling the script through Spark fails

Re: Job is Failing automatically

2015-08-11 Thread Jeff Zhang
15/08/11 12:59:34 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, sdldalplhdw02.suddenlink.cequel3.com): java.lang.NullPointerException at com.suddenlink.pnm.process.HBaseStoreHelper.flush(HBaseStoreHelper.java:313) It's your app error. NPE from HBaseStoreHelper

Re: Spark Job Hangs on our production cluster

2015-08-11 Thread Jeff Zhang
Logs would be helpful to diagnose. Could you attach the logs ? On Wed, Aug 12, 2015 at 5:19 AM, java8964 java8...@hotmail.com wrote: The executor's memory is reset by --executor-memory 24G for spark-shell. The one from the spark-env.sh is just for default setting. I can confirm from the

Exception in spark

2015-08-11 Thread Ravisankar Mani
Hi all, We got an exception like “org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object” when using some where condition queries. I am using 1.4.0 version spark. Is this exception resolved in latest spark? Regards, Ravi

RE: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-11 Thread Cheng, Hao
Definitely worth to try. And you can sort the record before writing out, and then you will get the parquet files without overlapping keys. Let us know if that helps. Hao From: Philip Weaver [mailto:philip.wea...@gmail.com] Sent: Wednesday, August 12, 2015 4:05 AM To: Cheng Lian Cc: user

pregel graphx job not finishing

2015-08-11 Thread dizzy5112
Hi im currently using a pregel message passing function for my graph in spark and graphx. The problem i have is that the code runs perfectly on spark 1.0 and finishes in a couple of minutes but as we have upgraded now im trying to run the same code on 1.3 but it doesnt finish (left it overnight