Re: Filter on Grouped Data
Why dont you apply filter first and then Group the data and run aggregations.. On Jul 3, 2015 1:29 PM, Megha Sridhar- Cynepia megha.sridh...@cynepia.com wrote: Hi, I have a Spark DataFrame object, which when trimmed, looks like, FromTo SubjectMessage-ID karen@xyz.com['vance.me...@enron.com', SEC Inquiry 19952575.1075858 'jeannie.mandel...@enron.com', 'mary.cl...@enron.com', 'sarah.pal...@enron.com'] elyn.hug...@xyz.com['dennis.ve...@enron.com',Revised documents33499184.1075858 'gina.tay...@enron.com', 'kelly.kimbe...@enron.com'] . . . I have run a groupBy(From) on the above dataFrame and obtained a GroupedData object as a result. I need to apply a filter on the grouped data (for instance, getting the sender who sent maximum number of the mails that were addressed to a particular receiver in the To list). Is there a way to accomplish this by applying filter on grouped data? Thanks, Megha - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
I have this problem with a job. A random executor gets this ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM Almost always at the same point in the processing of the data. I am processing 1 mil files with sc.wholeText. At around the 600.000th file, a container receives this signal. On the driver i get: 15/07/03 14:20:11 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. cruncher03.stratified:44617 15/07/03 14:20:11 ERROR cluster.YarnClusterScheduler: Lost executor 3 on cruncher03.stratified: remote Rpc client disassociated 15/07/03 14:20:11 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@cruncher03.stratified:44617] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/07/03 14:20:11 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. cruncher03.stratified:44617 15/07/03 14:20:11 INFO scheduler.TaskSetManager: Re-queueing tasks for 3 from TaskSet 5.0 There is plenty of memory on the machine and container jvm, so I don't think it is an OOM (after all it would be a SIGKILL) or an OutOfMemory (there is no out of mem exception) What can be causing this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-executor-CoarseGrainedExecutorBackend-RECEIVED-SIGNAL-15-SIGTERM-tp23613.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Streaming: updating broadcast variables
You cannot update the broadcasted variable.. It wont get reflected on workers. On Jul 3, 2015 12:18 PM, James Cole ja...@binarism.net wrote: Hi all, I'm filtering a DStream using a function. I need to be able to change this function while the application is running (I'm polling a service to see if a user has changed their filtering). The filter function is a transformation and runs on the workers, so that's where the updates need to go. I'm not sure of the best way to do this. Initially broadcasting seemed like the way to go: the filter is actually quite large. But I don't think I can update something I've broadcasted. I've tried unpersisting and re-creating the broadcast variable but it became obvious this wasn't updating the reference on the worker. So am I correct in thinking I can't use broadcasted variables for this purpose? The next option seems to be: stopping the JavaStreamingContext, creating a new one from the SparkContext, updating the filter function, and re-creating the DStreams (I'm using direct streams from Kafka). If I re-created the JavaStreamingContext would the accumulators (which are created from the SparkContext) keep working? (Obviously I'm going to try this soon) In summary: 1) Can broadcasted variables be updated? 2) Is there a better way than re-creating the JavaStreamingContext and DStreams? Thanks, James
Re: Optimizations
This is the basic design of spark that it runs all actions in different stages... Not sure you can achieve what you r looking for. On Jul 3, 2015 12:43 PM, Marius Danciu marius.dan...@gmail.com wrote: Hi all, If I have something like: rdd.join(...).mapPartitionToPair(...) It looks like mapPartitionToPair runs in a different stage then join. Is there a way to piggyback this computation inside the join stage ? ... such that each result partition after join is passed to the mapPartitionToPair function, all running in the same state without any other costs. Best, Marius
Re: Kryo fails to serialise output
Kryo serialization is used internally by Spark for spilling or shuffling intermediate results, not for writing out an RDD as an action. Look at Sandy Ryza's examples for some hints on how to do this: https://github.com/sryza/simplesparkavroapp Regards, Will On July 3, 2015, at 2:45 AM, Dominik Hübner cont...@dhuebner.com wrote: I have a rather simple avro schema to serialize Tweets (message, username, timestamp). Kryo and twitter chill are used to do so. For my dev environment the Spark context is configured as below val conf: SparkConf = new SparkConf() conf.setAppName(kryo_test) conf.setMaster(“local[4]) conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) conf.set(spark.kryo.registrator, co.feeb.TweetRegistrator”) Serialization is setup with override def registerClasses(kryo: Kryo): Unit = { kryo.register(classOf[Tweet], AvroSerializer.SpecificRecordBinarySerializer[Tweet]) } (This method gets called) Using this configuration to persist some object fails with java.io.NotSerializableException: co.feeb.avro.Tweet (which seems to be ok as this class is not Serializable) I used the following code: val ctx: SparkContext = new SparkContext(conf) val tweets: RDD[Tweet] = ctx.parallelize(List( new Tweet(a, b, 1L), new Tweet(c, d, 2L), new Tweet(e, f, 3L) ) ) tweets.saveAsObjectFile(file:///tmp/spark”) Using saveAsTextFile works, but persisted files are not binary but JSON cat /tmp/spark/part-0 {username: a, text: b, timestamp: 1} {username: c, text: d, timestamp: 2} {username: e, text: f, timestamp: 3} Is this intended behaviour, a configuration issue, avro serialisation not working in local mode or something else? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark-csv into labeled points with null values
Hello all, I am learning scala spark and going through some applications with data I have. Please allow me to put a couple questions: spark-csv: The data I have, ain't malformed, but there are empty values in some rows, properly comma-sepparated and not catched by DROPMALFORMED mode These values are taken into account as null values. My final mission is to create a LabeledPoint vector for MLLIB, so my steps are: a. load csv b. cast column types to have a proper DataFrame schema c. apply map() to create a LabeledPoint with denseVector. Using map( Row = Row.getDouble(col_index) ) To this point: res173: org.apache.spark.mllib.regression.LabeledPoint = (-1.530132691E9,[162.89431,13.55811,18.3346818,-1.6653182]) As running the following code: val model = new LogisticRegressionWithLBFGS(). setNumClasses(2). setValidateData(true). run(data_map) java.lang.RuntimeException: Failed to check null bit for primitive double value. Debugging this, I am pretty sure this is because rows that look like -2.593849123898,392.293891 Any suggestions to get round this? Saif
Re: Spark performance issue
It’ll help to see the code or at least understand what transformations you’re using. Also, you have 15 nodes but not using all of them, so that means you may be losing data locality. You can see this in the job UI for Spark if any jobs do not have node or process local. From: diplomatic Guru Date: Friday, July 3, 2015 at 8:58 AM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Spark performance issue Hello guys, I'm after some advice on Spark performance. I've a MapReduce job that read inputs carry out a simple calculation and write the results into HDFS. I've implemented the same logic in Spark job. When I tried both jobs on same datasets, I'm getting different execution time, which is expected. BUT .. In my example, MapReduce job is performing much better than Spark. The difference is that I'm not changing much with the MR job configuration, e.g., memory, cores, etc...But this is not the case with Spark as it's very flexible. So I'm sure my configuration isn't correct which is why MR is outperforming Spark but need your advice. For example: Test 1: 4.5GB data - MR job took ~55 seconds to compute, but Spark took ~3 minutes and 20 seconds. Test 2: 25GB data -MR took 2 minutes and 15 seconds, whereas Spark job is still running, and it's already been 15 minutes. I have a cluster of 15 nodes. The maximum memory that I could allocate to each executor is 6GB. Therefore, for Test 1, this is the config I used: --executor-memory 6G --num-executors 4 --driver-memory 6G --executor-cores 2 (also I set spark.storage.memoryFraction to 0.3) For Test 2: --executor-memory 6G --num-executors 10 --driver-memory 6G --executor-cores 2 (also I set spark.storage.memoryFraction to 0.3) I tried all possible combination but couldn't get better performance. Any suggestions will be much appreciated.
Re: thrift-server does not load jars files (Azure HDInsight)
Alternatively, setting spark.driver.extraClassPath should work. Cheers On Fri, Jul 3, 2015 at 2:59 AM, Steve Loughran ste...@hortonworks.com wrote: On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I'm trying to start the thrift-server and passing it azure's blob storage jars but I'm failing on : Caused by: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:342) ... 16 more If I start the spark-shell the same way, everything works fine. spark-shell command: ./bin/spark-shell --master yarn --jars /home/hdiuser/azureclass/azure-storage-1.2.0.jar,/home/hdiuser/azureclass/hadoop-azure-2.7.0.jar --num-executors 4 thrift-server command: ./sbin/start-thriftserver.sh --master yarn--jars /home/hdiuser/azureclass/azure-storage-1.2.0.jar,/home/hdiuser/azureclass/hadoop-azure-2.7.0.jar --num-executors 4 How can I pass dependency jars to the thrift server? Thanks, Daniel you should be able to add the JARs to the environment variable SPARK_SUBMIT_CLASSPATH or SPARK_CLASSPATH and have them picked up when bin/compute-classpath.{cmd.sh} builds up the classpath
Spark performance issue
Hello guys, I'm after some advice on Spark performance. I've a MapReduce job that read inputs carry out a simple calculation and write the results into HDFS. I've implemented the same logic in Spark job. When I tried both jobs on same datasets, I'm getting different execution time, which is expected. BUT .. In my example, MapReduce job is performing much better than Spark. The difference is that I'm not changing much with the MR job configuration, e.g., memory, cores, etc...But this is not the case with Spark as it's very flexible. So I'm sure my configuration isn't correct which is why MR is outperforming Spark but need your advice. For example: Test 1: 4.5GB data - MR job took ~55 seconds to compute, but Spark took ~3 minutes and 20 seconds. Test 2: 25GB data -MR took 2 minutes and 15 seconds, whereas Spark job is still running, and it's already been 15 minutes. I have a cluster of 15 nodes. The maximum memory that I could allocate to each executor is 6GB. Therefore, for Test 1, this is the config I used: --executor-memory 6G --num-executors 4 --driver-memory 6G --executor-cores 2 (also I set spark.storage.memoryFraction to 0.3) For Test 2: --executor-memory 6G --num-executors 10 --driver-memory 6G --executor-cores 2 (also I set spark.storage.memoryFraction to 0.3) I tried all possible combination but couldn't get better performance. Any suggestions will be much appreciated.
Re: Spark Streaming broadcast to all keys
updateStateByKey will run for all keys, whether they have new data in a batch or not so you should be able to still use it. On 7/3/15, 7:34 AM, micvog mich...@micvog.com wrote: UpdateStateByKey is useful but what if I want to perform an operation to all existing keys (not only the ones in this RDD). Word count for example - is there a way to decrease *all* words seen so far by 1? I was thinking of keeping a static class per node with the count information and issuing a broadcast command to take a certain action, but could not find a broadcast-to-all-nodes functionality or a better way. Thanks, Michael - Michael Vogiatzis @mvogiatzis -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-broadcast-to-all-keys-tp23609.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Float type coercion on SparkR with hiveContext
Hello, I'm got a trouble with float type coercion on SparkR with hiveContext. result - sql(hiveContext, SELECT offset, percentage from data limit 100) show(result) DataFrame[offset:float, percentage:float] head(result) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class jobj to a data.frame This trouble looks like already exists (SPARK-2863 - Emulate Hive type coercion in native reimplementations of Hive functions) with same reason - not completed native reimplementations of Hive... not ...functions only. So, anybody met in this issue before? And, how can I test it more precisely if it not looks like a bug? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkSQL cache table with multiple replicas
Hi all, Do you know if there is an option to specify how many replicas we want while caching in memory a table in SparkSQL Thrift server? I have not seen any option so far but I assumed there is an option as you can see in the Storage section of the UI that there is 1 x replica of your Dataframe/Table... I believe there can be a good use case on where you want to replicate a dimension table across your nodes to improve response times when running typical BI DWH types of queries (Just to avoid having to broadcast data every time and again). Do you think that would be a good addition to SparkSQL? Regards.
Re: duplicate names in sql allowed?
https://issues.apache.org/jira/browse/SPARK-8817 On Fri, Jul 3, 2015 at 11:43 AM, Koert Kuipers ko...@tresata.com wrote: i see the relaxation to allow duplicate field names was done on purpose, since some data sources can have dupes due to case insensitive resolution. apparently the issue is now dealt with during query analysis. although this might work for sql it does not seem a good thing for DataFrame to me. it seems desirable that a DataFrame should have unique column names. not having this guarantee will complicate building other DSLs on top of DataFrame (this is how i ran into this issue). its also counterintuitive... do R dataframes and pandas allow dupes in column names? On Jul 3, 2015 3:27 AM, Akhil Das ak...@sigmoidanalytics.com wrote: I think you can open up a jira, not sure if this PR https://github.com/apache/spark/pull/2209/files (SPARK-2890 https://issues.apache.org/jira/browse/SPARK-2890) broke the validation piece. Thanks Best Regards On Fri, Jul 3, 2015 at 4:29 AM, Koert Kuipers ko...@tresata.com wrote: i am surprised this is allowed... scala sqlContext.sql(select name as boo, score as boo from candidates).schema res7: org.apache.spark.sql.types.StructType = StructType(StructField(boo,StringType,true), StructField(boo,IntegerType,true)) should StructType check for duplicate field names?
Experience with centralised logging for Spark?
Hi all, I'm wondering if anybody as any experience with centralised logging for Spark - or even has felt that there was need for this given the WebUI. At my organization we use Log4j2 and Flume as the front end of our centralised logging system. I was looking into modifying Spark to use that system and I'm reconsidering my approach. I thought I'd ask the community to see what people have tried. Log4j2 is important because it works nicely with Flume. The problem I've got is that all of the Spark processes (master, worker, spark-submit) use the same conf directory and so would get the same log4j2.xml. This then means that they would try and use the same directory for the file channel (which will fail because Flume locks its directory). Secondly, if I want to add an interceptor to stamp every event with the component name then I cannot tell the difference between the components - everything would get 'apache-spark'. This could be fixed by modifying the start up scripts to pass the component name around; but that's more modification than I really want to make. So are people generally happy with the WebUI approach for getting access to stderr and stdout or have other peopled rolled better solutions? Yes, I'm aware of https://issues.apache.org/jira/browse/SPARK-6305 and the associated pull request. Many thanks, in advance, for your thoughts. Cheers, Edward
Re: build spark 1.4 source code for sparkR with maven
You need to add -Psparkr to build SparkR code Shivaram On Fri, Jul 3, 2015 at 2:14 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you try: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package Thanks Best Regards On Fri, Jul 3, 2015 at 2:27 PM, 1106944...@qq.com 1106944...@qq.com wrote: Hi all, Anyone build spark 1.4 source code for sparkR with maven/sbt, what's comand ? using sparkR must build from source code about 1.4 version . thank you -- 1106944...@qq.com
Re: duplicate names in sql allowed?
i see the relaxation to allow duplicate field names was done on purpose, since some data sources can have dupes due to case insensitive resolution. apparently the issue is now dealt with during query analysis. although this might work for sql it does not seem a good thing for DataFrame to me. it seems desirable that a DataFrame should have unique column names. not having this guarantee will complicate building other DSLs on top of DataFrame (this is how i ran into this issue). its also counterintuitive... do R dataframes and pandas allow dupes in column names? On Jul 3, 2015 3:27 AM, Akhil Das ak...@sigmoidanalytics.com wrote: I think you can open up a jira, not sure if this PR https://github.com/apache/spark/pull/2209/files (SPARK-2890 https://issues.apache.org/jira/browse/SPARK-2890) broke the validation piece. Thanks Best Regards On Fri, Jul 3, 2015 at 4:29 AM, Koert Kuipers ko...@tresata.com wrote: i am surprised this is allowed... scala sqlContext.sql(select name as boo, score as boo from candidates).schema res7: org.apache.spark.sql.types.StructType = StructType(StructField(boo,StringType,true), StructField(boo,IntegerType,true)) should StructType check for duplicate field names?
Re: Spark SQL groupby timestamp
@bastien, in those situations, I prefer to use Unix timestamps (millisecond or second granularity) because you can apply math operations to them easily. If you don't have a Unix timestamp, you can use unix_timestamp() from Hive SQL to get one with second granularity.Then doing grouping by hour becomes very simple: select 3600*floor(timestamp/3600) as timestamp, count(error) as errors,from logsgroup by 3600*floor(timestamp/3600) Hope this helps./Sim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-groupby-timestamp-tp23470p23615.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Optimizations
Thanks for your feedback. Yes I am aware of stages design and Silvio what you are describing is essentially map-side join which is not applicable when you have both RDDs quite large. It appears that rdd.join(...).mapToPair(f) f is piggybacked inside join stage (right in the reducers I believe) whereas rdd.join(...).mapPartitionToPair( f ) f is executed in a different stage. This is surprising because at least intuitively the difference between mapToPair and mapPartitionToPair is that that former is about the push model whereas the latter is about polling records out of the iterator (*I suspect there are other technical reasons*). If anyone know the depths of the problem if would be of great help. Best, Marius On Fri, Jul 3, 2015 at 6:43 PM Silvio Fiorito silvio.fior...@granturing.com wrote: One thing you could do is a broadcast join. You take your smaller RDD, save it as a broadcast variable. Then run a map operation to perform the join and whatever else you need to do. This will remove a shuffle stage but you will still have to collect the joined RDD and broadcast it. All depends on the size of your data if it’s worth it or not. From: Marius Danciu Date: Friday, July 3, 2015 at 3:13 AM To: user Subject: Optimizations Hi all, If I have something like: rdd.join(...).mapPartitionToPair(...) It looks like mapPartitionToPair runs in a different stage then join. Is there a way to piggyback this computation inside the join stage ? ... such that each result partition after join is passed to the mapPartitionToPair function, all running in the same state without any other costs. Best, Marius
Contineous errors trying to start spark-shell
Hello, I am trying to just start spark-shell... it starts, the prompt appears, then a never ending (literally) stream of these log lines proceeds What is it trying to do? Why is it failing? To start it I do: $ docker run -it ncssm/spark-base /spark/bin/spark-shell --master spark:// devzero.cs.georgetown.edu:7077 The log lines look like: 5/07/04 00:36:36 INFO Worker: Asked to launch executor app-20150704003631-0004/45 for Spark shell 15/07/04 00:36:36 INFO ExecutorRunner: Launch command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /spark/sbin/../conf/:/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar:/spark/lib/datanucleus-api-jdo-3.2.6.jar:/spark/lib/datanucleus-rdbms-3.2.9.jar:/spark/lib/datanucleus-core-3.2.10.jar -Xms512M -Xmx512M -Dspark.driver.port=50266 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver@172.17.0.45:50266/user/CoarseGrainedScheduler --executor-id 45 --hostname 10.212.55.41 --cores 16 --app-id app-20150704003631-0004 --worker-url akka.tcp:// sparkWorker@10.212.55.41:7078/user/Worker 15/07/04 00:36:39 INFO Worker: Executor app-20150704003631-0004/45 finished with state EXITED message Command exited with code 1 exitStatus 1 On an example worker node, I see a corresponding unending stream of errors: 15/07/04 00:36:31 INFO Worker: Asked to launch executor app-20150704003631-0004/7 for Spark shell 15/07/04 00:36:31 INFO ExecutorRunner: Launch command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /spark/sbin/../conf/:/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar:/spark/lib/datanucleus-api-jdo-3.2.6.jar:/spark/lib/datanucleus-rdbms-3.2.9.jar:/spark/lib/datanucleus-core-3.2.10.jar -Xms512M -Xmx512M -Dspark.driver.port=50266 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver@172.17.0.45:50266/user/CoarseGrainedScheduler --executor-id 7 --hostname 10.212.55.41 --cores 16 --app-id app-20150704003631-0004 --worker-url akka.tcp:// sparkWorker@10.212.55.41:7078/user/Worker 15/07/04 00:36:36 INFO Worker: Executor app-20150704003631-0004/7 finished with state EXITED message Command exited with code 1 exitStatus 1 Thanks, Mohamed.
Are Spark Streaming RDDs always processed in order?
I'm writing a Spark Streaming application that uses RabbitMQ to consume events. One feature of RabbitMQ that I intend to make use of is bulk ack of messages, i.e. no need to ack one-by-one, but only ack the last event in a batch and that would ack the entire batch. Before I commit to doing so, I'd like to know if Spark Streaming always processes RDDs in the same order they arrive in, i.e. if RDD1 arrives before RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is finished? This is crucial to the ack logic, since if RDD2 can be potentially processed while RDD1 is still being processed, then if I ack the the last event in RDD2 that would also ack all events in RDD1, even though they may have not been completely processed yet. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkR and Spark Mlib
No. Spark R is language binding for spark. MLlib is machine learning project on top of spark core On 4 Jul 2015 12:23, praveen S mylogi...@gmail.com wrote: Hi, Is sparkR and spark Mlib same?
SparkR and Spark Mlib
Hi, Is sparkR and spark Mlib same?
Re: How to timeout a task?
Ted, Thanks very much for your reply. It took me almost a week but I have finally had a chance to implement what you noted and it appears to be working locally. However, when I launch this onto a cluster on EC2 -- this doesn't work reliably. To expand, I think the issue is that some of the code we have takes the python GIL and hence no internal timeout will work. That is why I was hoping to learn of a task level timeout -- something at the Spark level -- the management level -- such that it can decide a task has taken to long and just kill it and move on. Does this make sense? Are you familiar with any such options? Best, - Bill On Sat, Jun 27, 2015 at 9:26 AM, Ted Yu yuzhih...@gmail.com wrote: Have you looked at: http://stackoverflow.com/questions/2281850/timeout-function-if-it-takes-too-long-to-finish FYI On Sat, Jun 27, 2015 at 8:33 AM, wasauce wferr...@gmail.com wrote: Hello! We use pyspark to run a set of data extractors (think regex). The extractors (regexes) generally run quite quickly and find a few matches which are returned and stored into a database. My question is -- is it possible to make the function that runs the extractors have a timeout? I.E. if for a given file the extractor runs for more than X seconds it terminates and returns a default value? Here is a code snippet of what we are doing with some comments as to which function I am looking to timeout. code: https://gist.github.com/wasauce/42a956a1371a2b564918 Thank you - Bill -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-timeout-a-task-tp23513.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Are Spark Streaming RDDs always processed in order?
I dont think you can expect any order guarantee except the records in one partition. On Jul 4, 2015 7:43 AM, khaledh khal...@gmail.com wrote: I'm writing a Spark Streaming application that uses RabbitMQ to consume events. One feature of RabbitMQ that I intend to make use of is bulk ack of messages, i.e. no need to ack one-by-one, but only ack the last event in a batch and that would ack the entire batch. Before I commit to doing so, I'd like to know if Spark Streaming always processes RDDs in the same order they arrive in, i.e. if RDD1 arrives before RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is finished? This is crucial to the ack logic, since if RDD2 can be potentially processed while RDD1 is still being processed, then if I ack the the last event in RDD2 that would also ack all events in RDD1, even though they may have not been completely processed yet. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.4 MLLib Bug?: Multiclass Classification requirement failed: sizeInBytes was negative
How many partitions do you have? It might be that one partition is too large, and there is Integer overflow. Could you double your number of partitions? Burak On Fri, Jul 3, 2015 at 4:41 AM, Danny kont...@dannylinden.de wrote: hi, i want to run a multiclass classification with 390 classes on120k label points(tf-idf vectors). but i get the following exception. If i reduce the number of classes to ~20 everythings work fine. How can i fix this? i use the LogisticRegressionWithLBFGS class for my classification on a 8 Node Cluster with total-executor-cores = 30 executor-memory = 20g My Exception: 15/07/02 15:55:00 INFO DAGScheduler: Job 11 finished: count at LBFGS.scala:170, took 0,521823 s 15/07/02 15:55:02 INFO MemoryStore: ensureFreeSpace(-1069858488) called with curMem=308280107, maxMem=3699737 15/07/02 15:55:02 INFO MemoryStore: Block broadcast_22 stored as values in memory (estimated size -1069858488.0 B, free 11.1 GB) Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -1069858488 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:812) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:635) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:993) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:85) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289) at org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:215) at org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:204) at breeze.optimize.CachedDiffFunction.calculate(CachedDiffFunction.scala:23) at breeze.optimize.FirstOrderMinimizer.calculateObjective(FirstOrderMinimizer.scala:108) at breeze.optimize.FirstOrderMinimizer.initialState(FirstOrderMinimizer.scala:101) at breeze.optimize.FirstOrderMinimizer.iterations(FirstOrderMinimizer.scala:146) at org.apache.spark.mllib.optimization.LBFGS$.runLBFGS(LBFGS.scala:178) at org.apache.spark.mllib.optimization.LBFGS.optimize(LBFGS.scala:117) at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:282) at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:205) at com.test.spark.SVMSimpleAppEC2$.createNaiveBayesModel(SVMSimpleAppEC2.scala:150) at com.test.spark.SVMSimpleAppEC2$.main(SVMSimpleAppEC2.scala:48) at com.test.spark.SVMSimpleAppEC2.main(SVMSimpleAppEC2.scala) ... 6 more 15/07/02 15:55:02 INFO SparkContext: Invoking stop() from shutdown hook -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-MLLib-Bug-Multiclass-Classification-requirement-failed-sizeInBytes-was-negative-tp23610.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Starting Spark without automatically starting HiveContext
With binary i think it might not be possible, although if you can download the sources and then build it then you can remove this function https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023 which initializes the SQLContext. Thanks Best Regards On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start the spark-shell it always start with HiveContext. How can I disable the HiveContext from being initialized automatically ? Thanks, Daniel
Re: Starting Spark without automatically starting HiveContext
The main reason is Spark's startup time and the need to configure a component I don't really need (without configs the hivecontext takes more time to load) Thanks, Daniel On 3 ביולי 2015, at 11:13, Robin East robin.e...@xense.co.uk wrote: As Akhil mentioned there isn’t AFAIK any kind of initialisation to stop the SQLContext being created. If you could articulate why you would need to do this (it’s not obvious to me what the benefit would be) then maybe this is something that could be included as a feature in a future release. It may also suggest a way to a workaround. On 3 Jul 2015, at 08:33, Daniel Haviv daniel.ha...@veracity-group.com wrote: Thanks I was looking for a less hack-ish way :) Daniel On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das ak...@sigmoidanalytics.com wrote: With binary i think it might not be possible, although if you can download the sources and then build it then you can remove this function which initializes the SQLContext. Thanks Best Regards On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start the spark-shell it always start with HiveContext. How can I disable the HiveContext from being initialized automatically ? Thanks, Daniel
Optimizations
Hi all, If I have something like: rdd.join(...).mapPartitionToPair(...) It looks like mapPartitionToPair runs in a different stage then join. Is there a way to piggyback this computation inside the join stage ? ... such that each result partition after join is passed to the mapPartitionToPair function, all running in the same state without any other costs. Best, Marius
Re: duplicate names in sql allowed?
I think you can open up a jira, not sure if this PR https://github.com/apache/spark/pull/2209/files (SPARK-2890 https://issues.apache.org/jira/browse/SPARK-2890) broke the validation piece. Thanks Best Regards On Fri, Jul 3, 2015 at 4:29 AM, Koert Kuipers ko...@tresata.com wrote: i am surprised this is allowed... scala sqlContext.sql(select name as boo, score as boo from candidates).schema res7: org.apache.spark.sql.types.StructType = StructType(StructField(boo,StringType,true), StructField(boo,IntegerType,true)) should StructType check for duplicate field names?
Filter on Grouped Data
Hi, I have a Spark DataFrame object, which when trimmed, looks like, FromTo SubjectMessage-ID karen@xyz.com['vance.me...@enron.com', SEC Inquiry 19952575.1075858 'jeannie.mandel...@enron.com', 'mary.cl...@enron.com', 'sarah.pal...@enron.com'] elyn.hug...@xyz.com['dennis.ve...@enron.com',Revised documents33499184.1075858 'gina.tay...@enron.com', 'kelly.kimbe...@enron.com'] . . . I have run a groupBy(From) on the above dataFrame and obtained a GroupedData object as a result. I need to apply a filter on the grouped data (for instance, getting the sender who sent maximum number of the mails that were addressed to a particular receiver in the To list). Is there a way to accomplish this by applying filter on grouped data? Thanks, Megha - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Starting Spark without automatically starting HiveContext
Hivecontext should be supersets of SQL context so you should be able to perform all your tasks. Are you facing any problem with hivecontext? On 3 Jul 2015 17:33, Daniel Haviv daniel.ha...@veracity-group.com wrote: Thanks I was looking for a less hack-ish way :) Daniel On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das ak...@sigmoidanalytics.com wrote: With binary i think it might not be possible, although if you can download the sources and then build it then you can remove this function https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023 which initializes the SQLContext. Thanks Best Regards On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start the spark-shell it always start with HiveContext. How can I disable the HiveContext from being initialized automatically ? Thanks, Daniel
build spark 1.4 source code for sparkR with maven
Hi all, Anyone build spark 1.4 source code for sparkR with maven/sbt, what's comand ? using sparkR must build from source code about 1.4 version . thank you 1106944...@qq.com
[spark1.4] sparkContext.stop causes exception on Mesos
Hello Spark developers, After upgrading to spark 1.4 on Mesos 0.22.1 existing code started to throw this exception when calling sparkContext.stop : (SparkListenerBus) [ERROR - org.apache.spark.Logging$class.logError(Logging.scala:96)] Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:146) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:146) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:146) at org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:190) at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56) at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1215) at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:730) at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1855) at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1816) at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130) ... 16 more I0701 15:03:46.101809 1612 sched.cpp:1589] Asked to stop the driver I0701 15:03:46.101971 1355 sched.cpp:831] Stopping framework '20150629-132734-1224736778-5050-6126-0028' This problems happens only when spark.eventLog.enabled flag is set to true, it happens also if sparkContext.stop is omitted in the code, I think because Spark shut down indirectly the spark context. Does anyone know what could cause this problem ? Thanks, Ayoub. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-4-sparkContext-stop-causes-exception-on-Mesos-tp23605.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Starting Spark without automatically starting HiveContext
Thanks I was looking for a less hack-ish way :) Daniel On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das ak...@sigmoidanalytics.com wrote: With binary i think it might not be possible, although if you can download the sources and then build it then you can remove this function https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023 which initializes the SQLContext. Thanks Best Regards On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start the spark-shell it always start with HiveContext. How can I disable the HiveContext from being initialized automatically ? Thanks, Daniel
Spark 1.4 MLLib Bug?: Multiclass Classification requirement failed: sizeInBytes was negative
hi, i want to run a multiclass classification with 390 classes on120k label points(tf-idf vectors). but i get the following exception. If i reduce the number of classes to ~20 everythings work fine. How can i fix this? i use the LogisticRegressionWithLBFGS class for my classification on a 8 Node Cluster with total-executor-cores = 30 executor-memory = 20g My Exception: 15/07/02 15:55:00 INFO DAGScheduler: Job 11 finished: count at LBFGS.scala:170, took 0,521823 s 15/07/02 15:55:02 INFO MemoryStore: ensureFreeSpace(-1069858488) called with curMem=308280107, maxMem=3699737 15/07/02 15:55:02 INFO MemoryStore: Block broadcast_22 stored as values in memory (estimated size -1069858488.0 B, free 11.1 GB) Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -1069858488 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:812) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:635) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:993) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:85) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289) at org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:215) at org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:204) at breeze.optimize.CachedDiffFunction.calculate(CachedDiffFunction.scala:23) at breeze.optimize.FirstOrderMinimizer.calculateObjective(FirstOrderMinimizer.scala:108) at breeze.optimize.FirstOrderMinimizer.initialState(FirstOrderMinimizer.scala:101) at breeze.optimize.FirstOrderMinimizer.iterations(FirstOrderMinimizer.scala:146) at org.apache.spark.mllib.optimization.LBFGS$.runLBFGS(LBFGS.scala:178) at org.apache.spark.mllib.optimization.LBFGS.optimize(LBFGS.scala:117) at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:282) at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:205) at com.test.spark.SVMSimpleAppEC2$.createNaiveBayesModel(SVMSimpleAppEC2.scala:150) at com.test.spark.SVMSimpleAppEC2$.main(SVMSimpleAppEC2.scala:48) at com.test.spark.SVMSimpleAppEC2.main(SVMSimpleAppEC2.scala) ... 6 more 15/07/02 15:55:02 INFO SparkContext: Invoking stop() from shutdown hook
Kryo fails to serialise output
I have a rather simple avro schema to serialize Tweets (message, username, timestamp). Kryo and twitter chill are used to do so. For my dev environment the Spark context is configured as below val conf: SparkConf = new SparkConf() conf.setAppName(kryo_test) conf.setMaster(“local[4]) conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) conf.set(spark.kryo.registrator, co.feeb.TweetRegistrator”) Serialization is setup with override def registerClasses(kryo: Kryo): Unit = { kryo.register(classOf[Tweet], AvroSerializer.SpecificRecordBinarySerializer[Tweet]) } (This method gets called) Using this configuration to persist some object fails with java.io.NotSerializableException: co.feeb.avro.Tweet (which seems to be ok as this class is not Serializable) I used the following code: val ctx: SparkContext = new SparkContext(conf) val tweets: RDD[Tweet] = ctx.parallelize(List( new Tweet(a, b, 1L), new Tweet(c, d, 2L), new Tweet(e, f, 3L) ) ) tweets.saveAsObjectFile(file:///tmp/spark”) Using saveAsTextFile works, but persisted files are not binary but JSON cat /tmp/spark/part-0 {username: a, text: b, timestamp: 1} {username: c, text: d, timestamp: 2} {username: e, text: f, timestamp: 3} Is this intended behaviour, a configuration issue, avro serialisation not working in local mode or something else? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Dataframe 1.4 (GroupBy partial match)
Hi Salih, Thanks for the links :) This seems very promising to me. When do you think this would be available in the spark codeline ? Thanks, Suraj On Fri, Jul 3, 2015 at 2:02 AM, Salih Oztop soz...@yahoo.com wrote: Hi Suraj, It seems your requirement is Record Linkage/Entity Resolution. https://en.wikipedia.org/wiki/Record_linkage http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf A presentation from Spark Summit using GraphX https://spark-summit.org/east-2015/talk/distributed-graph-based-entity-resolution-using-spark Kind Regards Salih Oztop 07856128843 http://www.linkedin.com/in/salihoztop -- *From:* Suraj Shetiya surajshet...@gmail.com *To:* Michael Armbrust mich...@databricks.com *Cc:* Salih Oztop soz...@yahoo.com; user@spark.apache.org user@spark.apache.org; megha.sridh...@cynepia.com *Sent:* Thursday, July 2, 2015 10:47 AM *Subject:* Re: Spark Dataframe 1.4 (GroupBy partial match) Hi Michael, Thanks for a quick response.. This sounds like something that would work. However, Rethinking the problem statement and various other use cases, which are growing, there are more such scenarios, where one could have columns with structured and unstructured data embedded (json or xml or other kind of collections), it may make sense to allow probabilistic groupby operations where the user can get the same functionality in one step instead of two.. Your thoughts on if that makes sense.. -Suraj -- Forwarded message -- From: Michael Armbrust mich...@databricks.com Date: Jul 2, 2015 12:49 AM Subject: Re: Spark Dataframe 1.4 (GroupBy partial match) To: Suraj Shetiya surajshet...@gmail.com Cc: Salih Oztop soz...@yahoo.com, user@spark.apache.org user@spark.apache.org You should probably write a UDF that uses regular expression or other string munging to canonicalize the subject and then group on that derived column. On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya surajshet...@gmail.com wrote: Thanks Salih. :) The output of the groupby is as below. 2015-01-14 SEC Inquiry 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry And subsequently, we would like to aggregate all messages with a particular reference subject. For instance the question we are trying to answer could be : Get the count of messages with a particular subject. Looking forward to any suggestion from you. On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop soz...@yahoo.com wrote: Hi Suraj What will be your output after group by? Since GroupBy is for aggregations like sum, count etc. If you want to count the 2015 records than it is possible. Kind Regards Salih Oztop -- *From:* Suraj Shetiya surajshet...@gmail.com *To:* user@spark.apache.org *Sent:* Tuesday, June 30, 2015 3:05 PM *Subject:* Spark Dataframe 1.4 (GroupBy partial match) I have a dataset (trimmed and simplified) with 2 columns as below. DateSubject 2015-01-14 SEC Inquiry 2014-02-12 Happy birthday 2014-02-13 Re: Happy birthday 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry I have imported the same in a Spark Dataframe. What I am looking at is groupBy subject field (however, I need a partial match to identify the discussion topic). For example in the above case.. I would like to group all messages, which have subject containing SEC Inquiry which returns following grouped frame: 2015-01-14 SEC Inquiry 2015-01-16 Re: SEC Inquiry 2015-01-18 Fwd: Re: SEC Inquiry Another usecase for a similar problem could be group by year (in the above example), it would mean partial match of the date field, which would mean groupBy Date by matching year as 2014 or 2015. Keenly Looking forward to reply/solution to the above. - Suraj -- Regards, Suraj
Streaming: updating broadcast variables
Hi all, I'm filtering a DStream using a function. I need to be able to change this function while the application is running (I'm polling a service to see if a user has changed their filtering). The filter function is a transformation and runs on the workers, so that's where the updates need to go. I'm not sure of the best way to do this. Initially broadcasting seemed like the way to go: the filter is actually quite large. But I don't think I can update something I've broadcasted. I've tried unpersisting and re-creating the broadcast variable but it became obvious this wasn't updating the reference on the worker. So am I correct in thinking I can't use broadcasted variables for this purpose? The next option seems to be: stopping the JavaStreamingContext, creating a new one from the SparkContext, updating the filter function, and re-creating the DStreams (I'm using direct streams from Kafka). If I re-created the JavaStreamingContext would the accumulators (which are created from the SparkContext) keep working? (Obviously I'm going to try this soon) In summary: 1) Can broadcasted variables be updated? 2) Is there a better way than re-creating the JavaStreamingContext and DStreams? Thanks, James
Re: Accessing the console from spark
In the driver when running spark-submit with --master yarn-client On Fri, Jul 3, 2015 at 10:23 AM Akhil Das ak...@sigmoidanalytics.com wrote: Where does it returns null? Within the driver or in the executor? I just tried System.console.readPassword in spark-shell and it worked. Thanks Best Regards On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker jem.tuc...@gmail.com wrote: Hi, We have an application that requires a username/password to be entered from the command line. To screen a password in java you need to use System.console.readPassword however when running with spark System.console returns null?? Any ideas on how to get the console from spark? Thanks, Jem
Re: thrift-server does not load jars files (Azure HDInsight)
On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv daniel.ha...@veracity-group.commailto:daniel.ha...@veracity-group.com wrote: Hi, I'm trying to start the thrift-server and passing it azure's blob storage jars but I'm failing on : Caused by: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:342) ... 16 more If I start the spark-shell the same way, everything works fine. spark-shell command: ./bin/spark-shell --master yarn --jars /home/hdiuser/azureclass/azure-storage-1.2.0.jar,/home/hdiuser/azureclass/hadoop-azure-2.7.0.jar --num-executors 4 thrift-server command: ./sbin/start-thriftserver.sh --master yarn--jars /home/hdiuser/azureclass/azure-storage-1.2.0.jar,/home/hdiuser/azureclass/hadoop-azure-2.7.0.jar --num-executors 4 How can I pass dependency jars to the thrift server? Thanks, Daniel you should be able to add the JARs to the environment variable SPARK_SUBMIT_CLASSPATH or SPARK_CLASSPATH and have them picked up when bin/compute-classpath.{cmd.sh} builds up the classpath
Re: Accessing the console from spark
Can you paste the code? Something is missing Thanks Best Regards On Fri, Jul 3, 2015 at 3:14 PM, Jem Tucker jem.tuc...@gmail.com wrote: In the driver when running spark-submit with --master yarn-client On Fri, Jul 3, 2015 at 10:23 AM Akhil Das ak...@sigmoidanalytics.com wrote: Where does it returns null? Within the driver or in the executor? I just tried System.console.readPassword in spark-shell and it worked. Thanks Best Regards On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker jem.tuc...@gmail.com wrote: Hi, We have an application that requires a username/password to be entered from the command line. To screen a password in java you need to use System.console.readPassword however when running with spark System.console returns null?? Any ideas on how to get the console from spark? Thanks, Jem
Multiple Join Conditions in dataframe join
Hi, I need to join with multiple conditions. Can anyone tell how to specify that. For e.g. this is what I am trying to do : val Lead_all = Leads. | join(Utm_Master, Leaddetails.columns(LeadSource,Utm_Source,Utm_Medium,Utm_Campaign) == Utm_Master.columns(LeadSource,Utm_Source,Utm_Medium,Utm_Campaign), left) When I do this I get error: too many arguments for method apply. Thanks Bipin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Join-Conditions-in-dataframe-join-tp23606.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Accessing the console from spark
Hi, We have an application that requires a username/password to be entered from the command line. To screen a password in java you need to use System.console.readPassword however when running with spark System.console returns null?? Any ideas on how to get the console from spark? Thanks, Jem
Re: build spark 1.4 source code for sparkR with maven
Did you try: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package Thanks Best Regards On Fri, Jul 3, 2015 at 2:27 PM, 1106944...@qq.com 1106944...@qq.com wrote: Hi all, Anyone build spark 1.4 source code for sparkR with maven/sbt, what's comand ? using sparkR must build from source code about 1.4 version . thank you -- 1106944...@qq.com
Re: Accessing the console from spark
Where does it returns null? Within the driver or in the executor? I just tried System.console.readPassword in spark-shell and it worked. Thanks Best Regards On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker jem.tuc...@gmail.com wrote: Hi, We have an application that requires a username/password to be entered from the command line. To screen a password in java you need to use System.console.readPassword however when running with spark System.console returns null?? Any ideas on how to get the console from spark? Thanks, Jem
Re: Accessing the console from spark
I have shown two senarios below: // setup spark context val user = readLine(username: ) val pass = System.console.readPassword(password: ) - null pointer exception here and // setup spark context val user = readLine(username: ) val console = System.console - null pointer exception here val pass = console.readPassword(password: ) thanks, Jem On Fri, Jul 3, 2015 at 11:04 AM Akhil Das ak...@sigmoidanalytics.com wrote: Can you paste the code? Something is missing Thanks Best Regards On Fri, Jul 3, 2015 at 3:14 PM, Jem Tucker jem.tuc...@gmail.com wrote: In the driver when running spark-submit with --master yarn-client On Fri, Jul 3, 2015 at 10:23 AM Akhil Das ak...@sigmoidanalytics.com wrote: Where does it returns null? Within the driver or in the executor? I just tried System.console.readPassword in spark-shell and it worked. Thanks Best Regards On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker jem.tuc...@gmail.com wrote: Hi, We have an application that requires a username/password to be entered from the command line. To screen a password in java you need to use System.console.readPassword however when running with spark System.console returns null?? Any ideas on how to get the console from spark? Thanks, Jem
Spark Streaming broadcast to all keys
UpdateStateByKey is useful but what if I want to perform an operation to all existing keys (not only the ones in this RDD). Word count for example - is there a way to decrease *all* words seen so far by 1? I was thinking of keeping a static class per node with the count information and issuing a broadcast command to take a certain action, but could not find a broadcast-to-all-nodes functionality or a better way. Thanks, Michael - Michael Vogiatzis @mvogiatzis -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-broadcast-to-all-keys-tp23609.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark 1.4 MLLib Bug?: Multiclass Classification requirement failed: sizeInBytes was negative
hi, i want to run a multiclass classification with 390 classes on120k label points(tf-idf vectors). but i get the following exception. If i reduce the number of classes to ~20 everythings work fine. How can i fix this? i use the LogisticRegressionWithLBFGS class for my classification on a 8 Node Cluster with total-executor-cores = 30 executor-memory = 20g My Exception: 15/07/02 15:55:00 INFO DAGScheduler: Job 11 finished: count at LBFGS.scala:170, took 0,521823 s 15/07/02 15:55:02 INFO MemoryStore: ensureFreeSpace(-1069858488) called with curMem=308280107, maxMem=3699737 15/07/02 15:55:02 INFO MemoryStore: Block broadcast_22 stored as values in memory (estimated size -1069858488.0 B, free 11.1 GB) Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -1069858488 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:812) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:635) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:993) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:85) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289) at org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:215) at org.apache.spark.mllib.optimization.LBFGS$CostFun.calculate(LBFGS.scala:204) at breeze.optimize.CachedDiffFunction.calculate(CachedDiffFunction.scala:23) at breeze.optimize.FirstOrderMinimizer.calculateObjective(FirstOrderMinimizer.scala:108) at breeze.optimize.FirstOrderMinimizer.initialState(FirstOrderMinimizer.scala:101) at breeze.optimize.FirstOrderMinimizer.iterations(FirstOrderMinimizer.scala:146) at org.apache.spark.mllib.optimization.LBFGS$.runLBFGS(LBFGS.scala:178) at org.apache.spark.mllib.optimization.LBFGS.optimize(LBFGS.scala:117) at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:282) at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:205) at com.test.spark.SVMSimpleAppEC2$.createNaiveBayesModel(SVMSimpleAppEC2.scala:150) at com.test.spark.SVMSimpleAppEC2$.main(SVMSimpleAppEC2.scala:48) at com.test.spark.SVMSimpleAppEC2.main(SVMSimpleAppEC2.scala) ... 6 more 15/07/02 15:55:02 INFO SparkContext: Invoking stop() from shutdown hook -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-MLLib-Bug-Multiclass-Classification-requirement-failed-sizeInBytes-was-negative-tp23610.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org