Re: [Spark Streaming] How to clear old data from Stream State?

2015-11-25 Thread Ted Yu
trackStateByKey API is in branch-1.6 FYI On Wed, Nov 25, 2015 at 6:03 AM, Todd Nist wrote: > Perhaps the new trackStateByKey targeted for very 1.6 may help you here. > I'm not sure if it is part of 1.6 or not for sure as the jira does not > specify a fixed version. The

data local read counter

2015-11-25 Thread Patcharee Thongtra
Hi, Is there a counter for data local read? I understood that it is locality level counter, but it seems not. Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

sc.textFile() does not count lines properly?

2015-11-25 Thread George Sigletos
Hello, I have a text file consisting of 483150 lines (wc -l "my_file.txt"). However when I read it using textFile: %pyspark rdd = sc.textFile("my_file.txt") print rdd.count() it returns 554420 lines. Any idea why this is happening? Is it using a different new line delimiter and how this can be

[ANNOUNCE] CFP open for ApacheCon North America 2016

2015-11-25 Thread Rich Bowen
Community growth starts by talking with those interested in your project. ApacheCon North America is coming, are you? We are delighted to announce that the Call For Presentations (CFP) is now open for ApacheCon North America. You can submit your proposed sessions at

Partial data transfer from one partition to other

2015-11-25 Thread Samarth Rastogi
I have a use case where I partition my data on key 1. But after sometime this key 1 can change to key 2. So now I want all the new data to go to key 2 partition and also the old key 1 data to key 2 partition. Is there a way to copy partial data from one partition to another partition. Samarth

JNI native linrary problem java.lang.UnsatisfiedLinkError

2015-11-25 Thread Oriol López Massaguer
Hi; I try to use a natie library inside Spark. I Oriol.

Re: Spark 1.5.2 JNI native library java.lang.UnsatisfiedLinkError

2015-11-25 Thread Ted Yu
In your spark-env, did you set LD_LIBRARY_PATH ? Cheers On Wed, Nov 25, 2015 at 7:32 AM, Oriol López Massaguer < oriol.lo...@gmail.com> wrote: > Hello; > > I'm trying to use a native library in Spark. > > I was using a simple standalone cluster with one master and worker. > > According to the

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-25 Thread Alex Gittens
Thanks, the issue was indeed the dfs replication factor. To fix it without entirely clearing out HDFS and rebooting, I first ran hdfs dfs -setrep -R -w 1 / to reduce all the current files' replication factor to 1 recursively from the root, then I changed the dfs.replication factor in

Spark 1.5.2 JNI native library java.lang.UnsatisfiedLinkError

2015-11-25 Thread Oriol López Massaguer
Hello; I'm trying to use a native library in Spark. I was using a simple standalone cluster with one master and worker. According to the documentation I edited the spark-defautls.conf by setting: spark.driver.extraClassPath=/opt/eTOX_spark/lib/org.RDKit.jar

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-25 Thread Ilya Ganelin
Turning off replication sacrifices durability of your data, so if a node goes down the data is lost - in case that's not obvious. On Wed, Nov 25, 2015 at 8:43 AM Alex Gittens wrote: > Thanks, the issue was indeed the dfs replication factor. To fix it without > entirely

Re: Spark Streaming idempotent writes to HDFS

2015-11-25 Thread Steve Loughran
On 25 Nov 2015, at 07:01, Michael > wrote: so basically writing them into a temporary directory named with the batch time and then move the files to their destination on success ? I wished there was a way to skip moving files around and be able to

Re: How to run two operations on the same RDD simultaneously

2015-11-25 Thread Jay Luan
Ah, thank you so much, this is perfect On Fri, Nov 20, 2015 at 3:48 PM, Ali Tajeldin EDU wrote: > You can try to use an Accumulator ( > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulator) > to keep count in map1. Note that the final

Adding new column to Dataframe

2015-11-25 Thread Vishnu Viswanath
Hi, I am trying to add the row number to a spark dataframe. This is my dataframe: scala> df.printSchema root |-- line: string (nullable = true) I tried to use df.withColumn but I am getting below exception. scala> df.withColumn("row",rowNumber) org.apache.spark.sql.AnalysisException:

RE: Building Spark without hive libraries

2015-11-25 Thread Mich Talebzadeh
Thanks Ted. I have the jar file scala-compiler-2.10.4.jar as well pwd / find ./ -name scala-compiler-2.10.4.jar ./usr/lib/spark/build/zinc-0.3.5.3/lib/scala-compiler-2.10.4.jar ./usr/lib/spark/build/apache-maven-3.3.3/lib/scala-compiler-2.10.4.jar

Re: Building Spark without hive libraries

2015-11-25 Thread Ted Yu
bq. I have to run this as root otherwise build does not progress I build Spark as non-root user and don't problem. I suggest you dig a little bit to see what was stalling running as non-root user. On Wed, Nov 25, 2015 at 2:48 PM, Mich Talebzadeh wrote: > Thanks Ted. > > >

RE: Building Spark without hive libraries

2015-11-25 Thread Mich Talebzadeh
Looks better Ted bow :) [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 39.937 s] [INFO] Spark Project Launcher . SUCCESS [

Re: Adding new column to Dataframe

2015-11-25 Thread Jeff Zhang
>>> I tried to use df.withColumn but I am getting below exception. What is rowNumber here ? UDF ? You can use monotonicallyIncreasingId for generating id >>> Also, is it possible to add a column from one dataframe to another? You can't, because how can you add one dataframe to another if they

Re: Building Spark without hive libraries

2015-11-25 Thread Ted Yu
bq. ^[[0m[^[[31merror^[[0m] ^[[0mRequired file not found: scala-compiler-2.10.4.jar^[[0m Can you search for the above jar ? I found two locally: /home/hbase/.ivy2/cache/org.scala-lang/scala-compiler/jars/scala-compiler-2.10.4.jar

Error in block pushing thread puts the KinesisReceiver in a stuck state

2015-11-25 Thread Spark Newbie
Hi Spark users, I have been seeing this issue where receivers enter a "stuck" state after it encounters a the following exception "Error in block pushing thread - java.util.concurrent.TimeoutException: Futures timed out". I am running the application on spark-1.4.1 and using kinesis-asl-1.4.

RE: Building Spark without hive libraries

2015-11-25 Thread Mich Talebzadeh
Yep. The user hduser was using the wrong version of maven hduser@rhes564::/usr/lib/spark> build/mvn -X -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package > log Using `mvn` from path: /usr/local/apache-maven/apache-maven-3.3.1/bin/mvn WARNING] Rule 0:

SparkException: Failed to get broadcast_10_piece0

2015-11-25 Thread Spark Newbie
Hi Spark users, I'm seeing the below exceptions once in a while which causes tasks to fail (even after retries, so it is a non recoverable exception I think), hence stage fails and then the job gets aborted. Exception --- java.io.IOException: org.apache.spark.SparkException: Failed to get

Re: SparkException: Failed to get broadcast_10_piece0

2015-11-25 Thread Ted Yu
Which Spark release are you using ? Please take a look at: https://issues.apache.org/jira/browse/SPARK-5594 Cheers On Wed, Nov 25, 2015 at 3:59 PM, Spark Newbie wrote: > Hi Spark users, > > I'm seeing the below exceptions once in a while which causes tasks to fail >

Re: SparkException: Failed to get broadcast_10_piece0

2015-11-25 Thread Spark Newbie
Using Spark-1.4.1 On Wed, Nov 25, 2015 at 4:19 PM, Ted Yu wrote: > Which Spark release are you using ? > > Please take a look at: > https://issues.apache.org/jira/browse/SPARK-5594 > > Cheers > > On Wed, Nov 25, 2015 at 3:59 PM, Spark Newbie >

Re: Adding new column to Dataframe

2015-11-25 Thread Vishnu Viswanath
Thanks Jeff, rowNumber is a function in org.apache.spark.sql.functions link I will try to use monotonicallyIncreasingId and see if it works. You’d better to use join to correlate 2 data frames : Yes,

Re: Adding more slaves to a running cluster

2015-11-25 Thread Andy Davidson
Hi Dillian and Nicholas If you figure out how to do this please post your recipe. It would be very useful andy From: Nicholas Chammas Date: Wednesday, November 25, 2015 at 11:36 AM To: Dillian Murphey , "user @spark"

Obtaining Job Id for query submitted via Spark Thrift Server

2015-11-25 Thread Jagrut Sharma
Is there a way to get the Job Id for a query submitted via the Spark Thrift Server? This would allow checking the status of that specific job via the History Server. Currently, I'm getting status of all jobs, and then filtering the results. Looking for a more efficient approach. Test environment

Re: Issue with spark on hive

2015-11-25 Thread rugalcrimson
Not yet, but I found that it works well in Spark 1.4.1. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-spark-on-hive-tp25372p25485.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Adding new column to Dataframe

2015-11-25 Thread Vishnu Viswanath
Thanks Ted, It looks like I cannot use row_number then. I tried to run a sample window function and got below error org.apache.spark.sql.AnalysisException: Could not resolve window function 'avg'. Note that, using window functions currently requires a HiveContext; On Wed, Nov 25, 2015 at 8:28

send this email to unsubscribe

2015-11-25 Thread ngocan211 .

Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-25 Thread Tathagata Das
What do you mean by killing the streaming job using UI? Do you mean that you are clicking the "kill" link in the Jobs page in the Spark UI? Also in the application, is the main thread waiting on streamingContext.awaitTermination()? That is designed to catch exceptions in running job and throw it

Re: Adding new column to Dataframe

2015-11-25 Thread Ted Yu
Vishnu: rowNumber (deprecated, replaced with row_number) is a window function. * Window function: returns a sequential number starting at 1 within a window partition. * * @group window_funcs * @since 1.6.0 */ def row_number(): Column = withExpr {

Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-25 Thread Kay-Uwe Moosheimer
Testet with Spark 1.5.2 Š Works perfect when exit code is non-zero. And does not Restart with exit code equals zero. Von: Prem Sure Datum: Mittwoch, 25. November 2015 19:57 An: SRK Cc: Betreff: Re: Automatic driver

Queue in Spark standalone mode

2015-11-25 Thread sunil m
Hi! I am using Spark 1.5.1 and pretty new to Spark... Like Yarn, is there a way to configure queues in Spark standalone mode? If yes, can someone point me to a good documentation / reference. Sometimes I get strange behavior while running multiple jobs simultaneously. Thanks in advance.

Re: sc.textFile() does not count lines properly?

2015-11-25 Thread George Sigletos
Found the problem. Control-M characters. Please ignore the post On Wed, Nov 25, 2015 at 6:06 PM, George Sigletos wrote: > Hello, > > I have a text file consisting of 483150 lines (wc -l "my_file.txt"). > > However when I read it using textFile: > > %pyspark > rdd =

Re: Adding more slaves to a running cluster

2015-11-25 Thread Dillian Murphey
It appears start-slave.sh works on a running cluster. I'm surprised I can't find more info on this. Maybe I'm not looking hard enough? Using AWS and spot instances is incredibly more efficient, which begs for the need of dynamically adding more nodes while the cluster is up, yet everything I've

Re: Adding more slaves to a running cluster

2015-11-25 Thread Nicholas Chammas
spark-ec2 does not directly support adding instances to an existing cluster, apart from the special case of adding slaves to a cluster with a master but no slaves. There is an open issue to track adding this support, SPARK-2008 , but it doesn't

Automatic driver restart does not seem to be working in Spark Standalone

2015-11-25 Thread SRK
Hi, I am submitting my Spark job with supervise option as shown below. When I kill the driver and the app from UI, the driver does not restart automatically. This is in a cluster mode. Any suggestion on how to make Automatic Driver Restart work would be of great help. --supervise Thanks,

Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-25 Thread Prem Sure
I think automatic driver restart will happen, if driver fails with non-zero exit code. --deploy-mode cluster --supervise On Wed, Nov 25, 2015 at 1:46 PM, SRK wrote: > Hi, > > I am submitting my Spark job with supervise option as shown below. When I > kill the

Re: Building Spark without hive libraries

2015-11-25 Thread Ted Yu
Take a look at install_zinc() in build/mvn Cheers On Wed, Nov 25, 2015 at 1:30 PM, Mich Talebzadeh wrote: > Hi, > > > > I am trying to build sparc from the source and not using Hive. I am > getting > > > > [error] Required file not found: scala-compiler-2.10.4.jar > >

Re: Queue in Spark standalone mode

2015-11-25 Thread Prem Sure
spark standalone mode submitted applications will run in FIFO (first-in-first-out) order. please elaborate "strange behavior while running multiple jobs simultaneously." On Wed, Nov 25, 2015 at 2:29 PM, sunil m <260885smanik...@gmail.com> wrote: > Hi! > > I am using Spark 1.5.1 and pretty new

Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-25 Thread swetha kasireddy
I am killing my Streaming job using UI. What error code does UI provide if the job is killed from there? On Wed, Nov 25, 2015 at 11:01 AM, Kay-Uwe Moosheimer wrote: > Testet with Spark 1.5.2 … Works perfect when exit code is non-zero. > And does not Restart with exit code

Building Spark without hive libraries

2015-11-25 Thread Mich Talebzadeh
Hi, I am trying to build sparc from the source and not using Hive. I am getting [error] Required file not found: scala-compiler-2.10.4.jar [error] See zinc -help for information about locating necessary files I have to run this as root otherwise build does not progress. Any help is

Spark- Cassandra Connector Error

2015-11-25 Thread ahlusar
Hello, I receive the following error when I attempt to connect to a Cassandra keyspace and table: WARN NettyUtil: "Found Netty's native epoll transport, but not running on linux-based operating system. Using NIO instead" The full details and log can be viewed here:

Re: How to gracefully shutdown a Spark Streaming Job?

2015-11-25 Thread Ted Yu
You can utilize this method from StreamingContext : def stop(stopSparkContext: Boolean, stopGracefully: Boolean): Unit = { Cheers On Wed, Nov 25, 2015 at 1:16 PM, SRK wrote: > Hi, > > How to gracefully shutdown a Spark Streaming Job? Currently I just kill it > from

Re: UDF with 2 arguments

2015-11-25 Thread Davies Liu
It works in master (1.6), what's the version of Spark you have? >>> from pyspark.sql.functions import udf >>> def f(a, b): pass ... >>> my_udf = udf(f) >>> from pyspark.sql.types import * >>> my_udf = udf(f, IntegerType()) On Wed, Nov 25, 2015 at 12:01 PM, Daniel Lopes

Re: Spark : merging object with approximation

2015-11-25 Thread Karl Higley
Hi, What merge behavior do you want when A~=B, B~=C but A!=C? Should the merge emit ABC? AB and BC? Something else? Best, Karl On Sat, Nov 21, 2015 at 5:24 AM OcterA wrote: > Hello, > > I have a set of X data (around 30M entry), I have to do a batch to merge > data which

Re: Spark REST Job server feedback?

2015-11-25 Thread Ruslan Dautkhanov
Very good question. From http://gethue.com/new-notebook-application-for-spark-sql/ "Use Livy Spark Job Server from the Hue master repository instead of CDH (it is currently much more advanced): see build & start the latest Livy

Re: Data in one partition after reduceByKey

2015-11-25 Thread Ruslan Dautkhanov
public long getTime() Returns the number of milliseconds since January 1, 1970, 00:00:00 GMT represented by this Date object. http://docs.oracle.com/javase/7/docs/api/java/util/Date.html#getTime%28%29 Based on what you did i might be easier to get date partitioner from that. Also, to get even

Re: Spark 1.5.2 JNI native library java.lang.UnsatisfiedLinkError

2015-11-25 Thread Oriol López Massaguer
LD_LIBRARY_PATH points to /opt/eTOX_spark/lib/ the location of the nattive libraries. Oriol. 2015-11-25 17:39 GMT+01:00 Ted Yu : > In your spark-env, did you set LD_LIBRARY_PATH ? > > Cheers > > On Wed, Nov 25, 2015 at 7:32 AM, Oriol López Massaguer < >

Fwd: pyspark: Error when training a GMM with an initial GaussianMixtureModel

2015-11-25 Thread Guillaume Maze
Hi all, We're trying to train a Gaussian Mixture Model (GMM) with a specified initial model. Doc 1.5.1 says we should use a GaussianMixtureModel object as input for the "initialModel" parameter to the GaussianMixture.train method. Before creating our own initial model (the plan is to use a Kmean

Re: Is it relevant to use BinaryClassificationMetrics.aucROC / aucPR with LogisticRegressionModel ?

2015-11-25 Thread filthysocks
jmvllt wrote > Here, because the predicted class will always be 0 or 1, there is no way > to vary the threshold to get the aucROC, right Or am I totally wrong > ? No, you are right. If you pass a (Score,Label) tuple to BinaryClassificationMetrics, then Score has to be the class probability.

[Spark Streaming] How to clear old data from Stream State?

2015-11-25 Thread diplomatic Guru
Hello, I know how I could clear the old state depending on the input value. If some condition matches to determine that the state is old then set the return null, will invalidate the record. But this is only feasible if a new record arrives that matches the old key. What if no new data arrives

Re: Is it relevant to use BinaryClassificationMetrics.aucROC / aucPR with LogisticRegressionModel ?

2015-11-25 Thread jmvllt
Hi filthysocks, Thanks for the answer. Indeed, using the clearThreshold() function solved my problem :). Regards, Jean. -- View this message in context:

Re: [Spark Streaming] How to clear old data from Stream State?

2015-11-25 Thread Todd Nist
Perhaps the new trackStateByKey targeted for very 1.6 may help you here. I'm not sure if it is part of 1.6 or not for sure as the jira does not specify a fixed version. The jira describing it is here: https://issues.apache.org/jira/browse/SPARK-2629, and the design doc that discusses the API

Spark, Windows 7 python shell non-reachable ip address

2015-11-25 Thread Shuo Wang
After running these two lines in the Quick Start example in spark's python shell on windows 7. >>> textFile = sc.textFile("README.md") >>> textFile.count() I am getting the following error: >>> textFile.count() 15/11/25 19:57:01 WARN : Your hostname, oh_t-PC resolves to a

Spark Driver Port Details

2015-11-25 Thread aman solanki
Hi, Can anyone tell me how i can get the details that a particular spark application is running on which particular port? For Example: I have two applications A and B A is running on 4040 B is running on 4041 How can i get these application port mapping? Is there a rest call or environment

Re: Spark Driver Port Details

2015-11-25 Thread Todd Nist
The default is to start applications with port 4040 and then increment them by 1 as you are seeing, see docs here: http://spark.apache.org/docs/latest/monitoring.html#web-interfaces You can override this behavior by setting passing the --conf spark.ui.port=4080 or in your code; something like

Spark, Windows 7 python shell non-reachable ip address

2015-11-25 Thread Shuo Wang
Hi, After running these two lines in the Quick Start example in spark's python shell on windows 7. >>> textFile = sc.textFile("README.md") >>> textFile.count() I am getting the following error: >>> textFile.count() 15/11/25 19:57:01 WARN : Your hostname, oh_t-PC resolves to a

log4j custom appender ClassNotFoundException with spark 1.5.2

2015-11-25 Thread lev
Hi, I'm using spark 1.5.2 and running on a yarn cluster and trying to use a custom log4j appender in my setup there are 3 jars: the uber jar: spark.yarn.jar=uber-jar.jar the jar that contains the main class: main.jar additional jar with dependencies: dep.jar (passed with the --jars flag to

JavaPairRDD.treeAggregate

2015-11-25 Thread amit tewari
Hi, does someone has experience/knowledge on using JavaPairRDD.treeAggregate? Even sample code will be helpful. Not many articles etc. available on web. Thanks Amit