Re: Writing to multiple outputs in Spark

2015-08-17 Thread Silas Davis
@Reynold Xin: not really: it only works for Parquet (see partitionBy:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter),
it requires you to have a DataFrame in the first place (for my use case the
spark sql interface to avro records is more of a hinderance than a help -
since I want to use generated java classes rather than treat avro records
as generic tables (via Rows)), and even if I do have a DataFrame, if I need
to map or mapPartitions I lose that interface and have to create a new
DataFrame from a RDD[Row], which isn't very convenient or efficient.

Has anyone been able to take a look at my gist:
https://gist.github.com/silasdavis/d1d1f1f7ab78249af462? The first 100
lines provides a base class for MutltipleOutputsFormats, then see line 269
for an example of how to use such an OutputFormat:
https://gist.github.com/silasdavis/d1d1f1f7ab78249af462#file-multipleoutputs-scala-L269

@Alex Angelini, the code there would support your use case without
modifying spark (it uses saveAsNewApiHadoopFile and a multiple outputs
wrapper format).

@Nicholas Chammas, I'll post a link to my gist on that ticket.

On Fri, 14 Aug 2015 at 21:10 Reynold Xin r...@databricks.com wrote:

 This is already supported with the new partitioned data sources in
 DataFrame/SQL right?


 On Fri, Aug 14, 2015 at 8:04 AM, Alex Angelini alex.angel...@shopify.com
 wrote:

 Speaking about Shopify's deployment, this would be a really nice to have
 feature.

 We would like to write data to folders with the structure
 `year/month/day` but have had to hold off on that because of the lack
 of support for MultipleOutputs.

 On Fri, Aug 14, 2015 at 10:56 AM, Silas Davis si...@silasdavis.net
 wrote:

 Would it be right to assume that the silence on this topic implies
 others don't really have this issue/desire?

 On Sat, 18 Jul 2015 at 17:24 Silas Davis si...@silasdavis.net wrote:

 *tl;dr hadoop and cascading* *provide ways of writing tuples to
 multiple output files based on key, but the plain RDD interface doesn't
 seem to and it should.*

 I have been looking into ways to write to multiple outputs in Spark. It
 seems like a feature that is somewhat missing from Spark.

 The idea is to partition output and write the elements of an RDD to
 different locations depending based on the key. For example in a pair RDD
 your key may be (language, date, userId) and you would like to write
 separate files to $someBasePath/$language/$date. Then there would be  a
 version of saveAsHadoopDataset that would be able to multiple location
 based on key using the underlying OutputFormat. Perahps it would take a
 pair RDD with keys ($partitionKey, $realKey), so for example ((language,
 date), userId).

 The prior art I have found on this is the following.

 Using SparkSQL:
 The 'partitionBy' method of DataFrameWriter:
 https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriter

 This only works for parquet at the moment.

 Using Spark/Hadoop:
 This pull request (with the hadoop1 API,) :
 https://github.com/apache/spark/pull/4895/files.

 This uses MultipleTextOutputFormat (which in turn uses
 MultipleOutputFormat) which is part of the old hadoop1 API. It only works
 for text but could be generalised for any underlying OutputFormat by using
 MultipleOutputFormat (but only for hadoop1 - which doesn't support
 ParquetAvroOutputFormat for example)

 This gist (With the hadoop2 API):
 https://gist.github.com/mlehman/df9546f6be2e362bbad2

 This uses MultipleOutputs (available for both the old and new hadoop
 APIs) and extends saveAsNewHadoopDataset to support multiple outputs.
 Should work for any underlying OutputFormat. Probably better implemented by
 extending saveAs[NewAPI]HadoopDataset.

 In Cascading:
 Cascading provides PartititionTap:
 http://docs.cascading.org/cascading/2.5/javadoc/cascading/tap/local/PartitionTap.html
 to do this

 So my questions are: is there a reason why Spark doesn't provide this?
 Does Spark provide similar functionality through some other mechanism? How
 would it be best implemented?

 Since I started composing this message I've had a go at writing an
 wrapper OutputFormat that writes multiple outputs using hadoop
 MultipleOutputs but doesn't require modification of the PairRDDFunctions.
 The principle is similar however. Again it feels slightly hacky to use
 dummy fields for the ReduceContextImpl, but some of this may be a part of
 the impedance mismatch between Spark and plain Hadoop... Here is my
 attempt: https://gist.github.com/silasdavis/d1d1f1f7ab78249af462

 I'd like to see this functionality in Spark somehow but invite
 suggestion of how best to achieve it.

 Thanks,
 Silas






[survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-17 Thread Nicholas Chammas
Howdy folks!

I’m interested in hearing about what people think of spark-ec2
http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
formal JIRA process. Your answers will all be anonymous and public.

If the embedded form below doesn’t work for you, you can use this link to
get the same survey:

http://goo.gl/forms/erct2s6KRR

Cheers!
Nick
​


Spark Job Hangs on our production cluster

2015-08-17 Thread java8964
I am comparing the log of Spark line by line between the hanging case (big 
dataset) and not hanging case (small dataset). 
In the hanging case, the Spark's log looks identical with not hanging case for 
reading the first block data from the HDFS.
But after that, starting from line 438 in the spark-hang.log, I only see the 
log generated from Worker, like following in the next 10 minutes:
15/08/14 14:24:19 DEBUG Worker: [actor] received message SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]15/08/14 14:24:19 DEBUG Worker: 
[actor] handled message (0.121965 ms) SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]...15/08/14
 14:33:04 DEBUG Worker: [actor] received message SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]15/08/14 14:33:04 DEBUG Worker: 
[actor] handled message (0.136146 ms) SendHeartbeat from 
Actor[akka://sparkWorker/user/Worker#90699948]
until almost 10 minutes I have to kill the job. I know it will hang forever.
But in the good log (spark-finished.log), starting from the line 361, Spark 
started to read the 2nd split data, I can see all the debug message from 
BlockReaderLocal, BlockManger.
If I compared between these 2 cases log:
in the good log case from line 478, I can saw this message:15/08/14 14:37:09 
DEBUG BlockReaderLocal: putting FileInputStream for ..
But in the hang log case for reading the 2nd split data, I don't see this 
message any more (It existed for the 1st split). I believe in this case, this 
log message should show up, as the 2nd split block also existed on this Spark 
node, as just before it, I can see the following debug message:
15/08/14 14:24:11 DEBUG BlockReaderLocal: Created BlockReaderLocal for file 
/services/contact2/data/contacts/20150814004805-part-r-2.avro block 
BP-834217708-10.20.95.130-1438701195738:blk_1074484553_1099531839081 in 
datanode 10.20.95.146:5001015/08/14 14:24:11 DEBUG Project: Creating 
MutableProj: WrappedArray(), inputSchema: ArrayBuffer(account_id#0L, 
contact_id#1, sequence_id#2, state#3, name#4, kind#5, prefix_name#6, 
first_name#7, middle_name#8, company_name#9, job_title#10, source_name#11, 
source_details#12, provider_name#13, provider_details#14, created_at#15L, 
create_source#16, updated_at#17L, update_source#18, accessed_at#19L, 
deleted_at#20L, delta#21, birthday_day#22, birthday_month#23, anniversary#24L, 
contact_fields#25, related_contacts#26, contact_channels#27, 
contact_notes#28, contact_service_addresses#29, contact_street_addresses#30), 
codegen:false
This log is generated on node (10.20.95.146), and Spark created 
BlockReaderLocal to read the data from the local node.
Now my question is, can someone give me any idea why DEBUG BlockReaderLocal: 
putting FileInputStream for  doesn't show up any more in this case?
I attached the log files again in this email, and really hope I can get some 
help from this list.
Thanks
Yong
From: java8...@hotmail.com
To: u...@spark.apache.org
Subject: RE: Spark Job Hangs on our production cluster
Date: Fri, 14 Aug 2015 15:14:10 -0400




I still want to check if anyone can provide any help related to the Spark 1.2.2 
will hang on our production cluster when reading Big HDFS data (7800 avro 
blocks), while looks fine for small data (769 avro blocks).
I enable the debug level in the spark log4j, and attached the log file if it 
helps to trouble shooting in this case.
Summary of our cluster:
IBM BigInsight V3.0.0.2 (running with Hadoop 2.2.0 + Hive 0.12)42 Data nodes, 
each one is running HDFS data node process + task tracker + spark workerOne 
master, running HDFS Name node + Spark masterAnother master node, running 2nd 
Name node + JobTracker
The test cases I did are 2, using very simple spark shell to read 2 folders, 
one is big data with 1T avro files; another one is small data with 160G avro 
files.
The avro files schema of 2 folders are different, but I don't think that will 
make any difference here.
The test script is like following:
import org.apache.spark.sql.SQLContextval sqlContext = new 
org.apache.spark.sql.hive.HiveContext(sc)import com.databricks.spark.avro._val 
testdata = sqlContext.avroFile(hdfs://namenode:9000/bigdata_folder)   // vs 
sqlContext.avroFile(hdfs://namenode:9000/smalldata_folder)testdata.registerTempTable(testdata)testdata.count()
Both cases are kicking off as the same following:/opt/spark/bin/spark-shell 
--jars /opt/ibm/cclib/spark-avro.jar --conf spark.ui.port=4042 
--executor-memory 24G --total-executor-cores 42 --conf 
spark.storage.memoryFraction=0.1 --conf spark.sql.shuffle.partitions=2000 
--conf spark.default.parallelism=2000
When the script point to the small data folder, the Spark can finish very fast. 
Each task of scanning the HDFS block can finish within 30 seconds or less.
When the script point to the big data folder, most of the nodes can finish scan 
the first block of HDFS within 2 mins (longer than case 1), then the scanning 
will 

Re: [ANNOUNCE] Nightly maven and package builds for Spark

2015-08-17 Thread Shivaram Venkataraman
This should be fixed now. I just triggered a manual build and the
latest binaries are at
http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/spark-1.5.0-SNAPSHOT-2015_08_17_00_36-3ff81ad-bin/

Thanks
Shivaram

On Mon, Aug 17, 2015 at 12:26 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
 thx for this, let me know if you need help

 2015-08-16 23:38 GMT+02:00 Shivaram Venkataraman
 shiva...@eecs.berkeley.edu:

 I just investigated this and this is happening because of a Maven
 version requirement not being met. I'll look at modifying the build
 scripts to use Maven 3.3.3 (with build/mvn --force ?)

 Shivaram

 On Sun, Aug 16, 2015 at 10:16 AM, Olivier Girardot
 o.girar...@lateral-thoughts.com wrote:
  Hi Patrick,
  is there any way for the nightly build to include common distributions
  like
  : with/without hive/yarn support, hadoop 2.4, 2.6 etc... ?
  For now it seems that the nightly binary package builds actually ships
  only
  the source ?
  I can help on that too if you want,
 
  Regards,
 
  Olivier.
 
  2015-08-02 5:19 GMT+02:00 Bharath Ravi Kumar reachb...@gmail.com:
 
  Thanks for fixing it.
 
  On Sun, Aug 2, 2015 at 3:17 AM, Patrick Wendell pwend...@gmail.com
  wrote:
 
  Hey All,
 
  I got it up and running - it was a newly surfaced bug in the build
  scripts.
 
  - Patrick
 
  On Wed, Jul 29, 2015 at 6:05 AM, Bharath Ravi Kumar
  reachb...@gmail.com
  wrote:
   Hey Patrick,
  
   Any update on this front please?
  
   Thanks,
   Bharath
  
   On Fri, Jul 24, 2015 at 8:38 PM, Patrick Wendell
   pwend...@gmail.com
   wrote:
  
   Hey Bharath,
  
   There was actually an incompatible change to the build process that
   broke several of the Jenkins builds. This should be patched up in
   the
   next day or two and nightly builds will resume.
  
   - Patrick
  
   On Fri, Jul 24, 2015 at 12:51 AM, Bharath Ravi Kumar
   reachb...@gmail.com wrote:
I noticed the last (1.5) build has a timestamp of 16th July. Have
nightly
builds been discontinued since then?
   
Thanks,
Bharath
   
On Sun, May 24, 2015 at 1:11 PM, Patrick Wendell
pwend...@gmail.com
wrote:
   
Hi All,
   
This week I got around to setting up nightly builds for Spark on
Jenkins. I'd like feedback on these and if it's going well I can
merge
the relevant automation scripts into Spark mainline and document
it
on
the website. Right now I'm doing:
   
1. SNAPSHOT's of Spark master and release branches published to
ASF
Maven snapshot repo:
   
   
   
   
   
https://repository.apache.org/content/repositories/snapshots/org/apache/spark/
   
These are usable by adding this repository in your build and
using
a
snapshot version (e.g. 1.3.2-SNAPSHOT).
   
2. Nightly binary package builds and doc builds of master and
release
versions.
   
http://people.apache.org/~pwendell/spark-nightly/
   
These build 4 times per day and are tagged based on commits.
   
If anyone has feedback on these please let me know.
   
Thanks!
- Patrick
   
   
   
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org
   
   
  
  
 
 
 
 
 
  --
  Olivier Girardot | Associé
  o.girar...@lateral-thoughts.com
  +33 6 24 09 17 94




 --
 Olivier Girardot | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Subscribe

2015-08-17 Thread Rishitesh Mishra



Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-17 Thread Jerry Lam
Hi Nick,

I forgot to mention in the survey that ganglia is never installed properly
for some reasons.

I have this exception every time I launched the cluster:

Starting httpd: httpd: Syntax error on line 154 of
/etc/httpd/conf/httpd.conf: Cannot load
/etc/httpd/modules/mod_authz_core.so into server:
/etc/httpd/modules/mod_authz_core.so: cannot open shared object file: No
such file or directory

[FAILED]

Best Regards,

Jerry

On Mon, Aug 17, 2015 at 11:09 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Howdy folks!

 I’m interested in hearing about what people think of spark-ec2
 http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
 formal JIRA process. Your answers will all be anonymous and public.

 If the embedded form below doesn’t work for you, you can use this link to
 get the same survey:

 http://goo.gl/forms/erct2s6KRR

 Cheers!
 Nick
 ​