Re: Writing to multiple outputs in Spark
@Reynold Xin: not really: it only works for Parquet (see partitionBy: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter), it requires you to have a DataFrame in the first place (for my use case the spark sql interface to avro records is more of a hinderance than a help - since I want to use generated java classes rather than treat avro records as generic tables (via Rows)), and even if I do have a DataFrame, if I need to map or mapPartitions I lose that interface and have to create a new DataFrame from a RDD[Row], which isn't very convenient or efficient. Has anyone been able to take a look at my gist: https://gist.github.com/silasdavis/d1d1f1f7ab78249af462? The first 100 lines provides a base class for MutltipleOutputsFormats, then see line 269 for an example of how to use such an OutputFormat: https://gist.github.com/silasdavis/d1d1f1f7ab78249af462#file-multipleoutputs-scala-L269 @Alex Angelini, the code there would support your use case without modifying spark (it uses saveAsNewApiHadoopFile and a multiple outputs wrapper format). @Nicholas Chammas, I'll post a link to my gist on that ticket. On Fri, 14 Aug 2015 at 21:10 Reynold Xin r...@databricks.com wrote: This is already supported with the new partitioned data sources in DataFrame/SQL right? On Fri, Aug 14, 2015 at 8:04 AM, Alex Angelini alex.angel...@shopify.com wrote: Speaking about Shopify's deployment, this would be a really nice to have feature. We would like to write data to folders with the structure `year/month/day` but have had to hold off on that because of the lack of support for MultipleOutputs. On Fri, Aug 14, 2015 at 10:56 AM, Silas Davis si...@silasdavis.net wrote: Would it be right to assume that the silence on this topic implies others don't really have this issue/desire? On Sat, 18 Jul 2015 at 17:24 Silas Davis si...@silasdavis.net wrote: *tl;dr hadoop and cascading* *provide ways of writing tuples to multiple output files based on key, but the plain RDD interface doesn't seem to and it should.* I have been looking into ways to write to multiple outputs in Spark. It seems like a feature that is somewhat missing from Spark. The idea is to partition output and write the elements of an RDD to different locations depending based on the key. For example in a pair RDD your key may be (language, date, userId) and you would like to write separate files to $someBasePath/$language/$date. Then there would be a version of saveAsHadoopDataset that would be able to multiple location based on key using the underlying OutputFormat. Perahps it would take a pair RDD with keys ($partitionKey, $realKey), so for example ((language, date), userId). The prior art I have found on this is the following. Using SparkSQL: The 'partitionBy' method of DataFrameWriter: https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriter This only works for parquet at the moment. Using Spark/Hadoop: This pull request (with the hadoop1 API,) : https://github.com/apache/spark/pull/4895/files. This uses MultipleTextOutputFormat (which in turn uses MultipleOutputFormat) which is part of the old hadoop1 API. It only works for text but could be generalised for any underlying OutputFormat by using MultipleOutputFormat (but only for hadoop1 - which doesn't support ParquetAvroOutputFormat for example) This gist (With the hadoop2 API): https://gist.github.com/mlehman/df9546f6be2e362bbad2 This uses MultipleOutputs (available for both the old and new hadoop APIs) and extends saveAsNewHadoopDataset to support multiple outputs. Should work for any underlying OutputFormat. Probably better implemented by extending saveAs[NewAPI]HadoopDataset. In Cascading: Cascading provides PartititionTap: http://docs.cascading.org/cascading/2.5/javadoc/cascading/tap/local/PartitionTap.html to do this So my questions are: is there a reason why Spark doesn't provide this? Does Spark provide similar functionality through some other mechanism? How would it be best implemented? Since I started composing this message I've had a go at writing an wrapper OutputFormat that writes multiple outputs using hadoop MultipleOutputs but doesn't require modification of the PairRDDFunctions. The principle is similar however. Again it feels slightly hacky to use dummy fields for the ReduceContextImpl, but some of this may be a part of the impedance mismatch between Spark and plain Hadoop... Here is my attempt: https://gist.github.com/silasdavis/d1d1f1f7ab78249af462 I'd like to see this functionality in Spark somehow but invite suggestion of how best to achieve it. Thanks, Silas
[survey] [spark-ec2] What do you like/dislike about spark-ec2?
Howdy folks! I’m interested in hearing about what people think of spark-ec2 http://spark.apache.org/docs/latest/ec2-scripts.html outside of the formal JIRA process. Your answers will all be anonymous and public. If the embedded form below doesn’t work for you, you can use this link to get the same survey: http://goo.gl/forms/erct2s6KRR Cheers! Nick
Spark Job Hangs on our production cluster
I am comparing the log of Spark line by line between the hanging case (big dataset) and not hanging case (small dataset). In the hanging case, the Spark's log looks identical with not hanging case for reading the first block data from the HDFS. But after that, starting from line 438 in the spark-hang.log, I only see the log generated from Worker, like following in the next 10 minutes: 15/08/14 14:24:19 DEBUG Worker: [actor] received message SendHeartbeat from Actor[akka://sparkWorker/user/Worker#90699948]15/08/14 14:24:19 DEBUG Worker: [actor] handled message (0.121965 ms) SendHeartbeat from Actor[akka://sparkWorker/user/Worker#90699948]...15/08/14 14:33:04 DEBUG Worker: [actor] received message SendHeartbeat from Actor[akka://sparkWorker/user/Worker#90699948]15/08/14 14:33:04 DEBUG Worker: [actor] handled message (0.136146 ms) SendHeartbeat from Actor[akka://sparkWorker/user/Worker#90699948] until almost 10 minutes I have to kill the job. I know it will hang forever. But in the good log (spark-finished.log), starting from the line 361, Spark started to read the 2nd split data, I can see all the debug message from BlockReaderLocal, BlockManger. If I compared between these 2 cases log: in the good log case from line 478, I can saw this message:15/08/14 14:37:09 DEBUG BlockReaderLocal: putting FileInputStream for .. But in the hang log case for reading the 2nd split data, I don't see this message any more (It existed for the 1st split). I believe in this case, this log message should show up, as the 2nd split block also existed on this Spark node, as just before it, I can see the following debug message: 15/08/14 14:24:11 DEBUG BlockReaderLocal: Created BlockReaderLocal for file /services/contact2/data/contacts/20150814004805-part-r-2.avro block BP-834217708-10.20.95.130-1438701195738:blk_1074484553_1099531839081 in datanode 10.20.95.146:5001015/08/14 14:24:11 DEBUG Project: Creating MutableProj: WrappedArray(), inputSchema: ArrayBuffer(account_id#0L, contact_id#1, sequence_id#2, state#3, name#4, kind#5, prefix_name#6, first_name#7, middle_name#8, company_name#9, job_title#10, source_name#11, source_details#12, provider_name#13, provider_details#14, created_at#15L, create_source#16, updated_at#17L, update_source#18, accessed_at#19L, deleted_at#20L, delta#21, birthday_day#22, birthday_month#23, anniversary#24L, contact_fields#25, related_contacts#26, contact_channels#27, contact_notes#28, contact_service_addresses#29, contact_street_addresses#30), codegen:false This log is generated on node (10.20.95.146), and Spark created BlockReaderLocal to read the data from the local node. Now my question is, can someone give me any idea why DEBUG BlockReaderLocal: putting FileInputStream for doesn't show up any more in this case? I attached the log files again in this email, and really hope I can get some help from this list. Thanks Yong From: java8...@hotmail.com To: u...@spark.apache.org Subject: RE: Spark Job Hangs on our production cluster Date: Fri, 14 Aug 2015 15:14:10 -0400 I still want to check if anyone can provide any help related to the Spark 1.2.2 will hang on our production cluster when reading Big HDFS data (7800 avro blocks), while looks fine for small data (769 avro blocks). I enable the debug level in the spark log4j, and attached the log file if it helps to trouble shooting in this case. Summary of our cluster: IBM BigInsight V3.0.0.2 (running with Hadoop 2.2.0 + Hive 0.12)42 Data nodes, each one is running HDFS data node process + task tracker + spark workerOne master, running HDFS Name node + Spark masterAnother master node, running 2nd Name node + JobTracker The test cases I did are 2, using very simple spark shell to read 2 folders, one is big data with 1T avro files; another one is small data with 160G avro files. The avro files schema of 2 folders are different, but I don't think that will make any difference here. The test script is like following: import org.apache.spark.sql.SQLContextval sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)import com.databricks.spark.avro._val testdata = sqlContext.avroFile(hdfs://namenode:9000/bigdata_folder) // vs sqlContext.avroFile(hdfs://namenode:9000/smalldata_folder)testdata.registerTempTable(testdata)testdata.count() Both cases are kicking off as the same following:/opt/spark/bin/spark-shell --jars /opt/ibm/cclib/spark-avro.jar --conf spark.ui.port=4042 --executor-memory 24G --total-executor-cores 42 --conf spark.storage.memoryFraction=0.1 --conf spark.sql.shuffle.partitions=2000 --conf spark.default.parallelism=2000 When the script point to the small data folder, the Spark can finish very fast. Each task of scanning the HDFS block can finish within 30 seconds or less. When the script point to the big data folder, most of the nodes can finish scan the first block of HDFS within 2 mins (longer than case 1), then the scanning will
Re: [ANNOUNCE] Nightly maven and package builds for Spark
This should be fixed now. I just triggered a manual build and the latest binaries are at http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/spark-1.5.0-SNAPSHOT-2015_08_17_00_36-3ff81ad-bin/ Thanks Shivaram On Mon, Aug 17, 2015 at 12:26 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: thx for this, let me know if you need help 2015-08-16 23:38 GMT+02:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu: I just investigated this and this is happening because of a Maven version requirement not being met. I'll look at modifying the build scripts to use Maven 3.3.3 (with build/mvn --force ?) Shivaram On Sun, Aug 16, 2015 at 10:16 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi Patrick, is there any way for the nightly build to include common distributions like : with/without hive/yarn support, hadoop 2.4, 2.6 etc... ? For now it seems that the nightly binary package builds actually ships only the source ? I can help on that too if you want, Regards, Olivier. 2015-08-02 5:19 GMT+02:00 Bharath Ravi Kumar reachb...@gmail.com: Thanks for fixing it. On Sun, Aug 2, 2015 at 3:17 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, I got it up and running - it was a newly surfaced bug in the build scripts. - Patrick On Wed, Jul 29, 2015 at 6:05 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: Hey Patrick, Any update on this front please? Thanks, Bharath On Fri, Jul 24, 2015 at 8:38 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Bharath, There was actually an incompatible change to the build process that broke several of the Jenkins builds. This should be patched up in the next day or two and nightly builds will resume. - Patrick On Fri, Jul 24, 2015 at 12:51 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: I noticed the last (1.5) build has a timestamp of 16th July. Have nightly builds been discontinued since then? Thanks, Bharath On Sun, May 24, 2015 at 1:11 PM, Patrick Wendell pwend...@gmail.com wrote: Hi All, This week I got around to setting up nightly builds for Spark on Jenkins. I'd like feedback on these and if it's going well I can merge the relevant automation scripts into Spark mainline and document it on the website. Right now I'm doing: 1. SNAPSHOT's of Spark master and release branches published to ASF Maven snapshot repo: https://repository.apache.org/content/repositories/snapshots/org/apache/spark/ These are usable by adding this repository in your build and using a snapshot version (e.g. 1.3.2-SNAPSHOT). 2. Nightly binary package builds and doc builds of master and release versions. http://people.apache.org/~pwendell/spark-nightly/ These build 4 times per day and are tagged based on commits. If anyone has feedback on these please let me know. Thanks! - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Olivier Girardot | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- Olivier Girardot | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Subscribe
Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?
Hi Nick, I forgot to mention in the survey that ganglia is never installed properly for some reasons. I have this exception every time I launched the cluster: Starting httpd: httpd: Syntax error on line 154 of /etc/httpd/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so into server: /etc/httpd/modules/mod_authz_core.so: cannot open shared object file: No such file or directory [FAILED] Best Regards, Jerry On Mon, Aug 17, 2015 at 11:09 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Howdy folks! I’m interested in hearing about what people think of spark-ec2 http://spark.apache.org/docs/latest/ec2-scripts.html outside of the formal JIRA process. Your answers will all be anonymous and public. If the embedded form below doesn’t work for you, you can use this link to get the same survey: http://goo.gl/forms/erct2s6KRR Cheers! Nick