Re: EDI (Electronic Data Interchange) parser on Spark

2018-03-13 Thread Darin McBeath
}       {$pii}       {$content-type}       {$srctitle}       {$document-type}       {$document-subtype}       {$publication-date}       {$article-title}       {$issn}       {$isbn}            {$lang}        {$tables}       return xml-to-json($retval) Darin. On Tuesday, March 13, 2018, 8:52:42 AM

Re: spark streaming exectors memory increasing and executor killed by yarn

2017-03-20 Thread darin
This issue on stackoverflow maybe help https://stackoverflow.com/questions/42641573/why-does-memory-usage-of-spark-worker-increases-with-time/42642233#42642233 -- View this message in context:

Re: spark streaming exectors memory increasing and executor killed by yarn

2017-03-17 Thread darin
I add this code in foreachRDD block . ``` rdd.persist(StorageLevel.MEMORY_AND_DISK) ``` This exception no occur agein.But many executor dead showing in spark streaming UI . ``` User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in stage 1194.0

spark streaming exectors memory increasing and executor killed by yarn

2017-03-16 Thread darin
Hi, I got this exception when streaming program run some hours. ``` *User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in stage 1194.0 failed 4 times, most recent failure: Lost task 21.3 in stage 1194.0 (TID 2475, 2.dev3, executor 66):

Re: [SparkSQL] too many open files although ulimit set to 1048576

2017-03-13 Thread darin
I think your sets not works try add `ulimit -n 10240 ` in spark-env.sh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-too-many-open-files-although-ulimit-set-to-1048576-tp28490p28491.html Sent from the Apache Spark User List mailing list archive

Dataset doesn't have partitioner after a repartition on one of the columns

2016-09-20 Thread McBeath, Darin W (ELS-STL)
ner I get Option[org.apache.spark.Partitioner] = None I would have thought it would be HashPartitioner. Does anyone know why this would be None and not HashPartitioner? Thanks. Darin.

How to find the partitioner for a Dataset

2016-09-07 Thread Darin McBeath
be appreciated. Thanks. Darin. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-the-partitioner-for-a-Dataset-tp27672.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Datasets and Partitioners

2016-09-06 Thread Darin McBeath
. Thanks. Darin. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Dataset Filter performance - trying to understand

2016-09-01 Thread Darin McBeath
he size of my cluster) but I also want to clarify my understanding of whole-stage code generation. Any thought/suggestions would be appreciated. Also, if anyone has found good resources that further explain the details of the DAG and whole-stage code generation,

Re: Best way to read XML data from RDD

2016-08-22 Thread Darin McBeath
is that it returns a string.  So, you have to be a little creative when returning multiple values (such as delimiting the values with a special character and then splitting on this delimiter).   Darin. From: Diwakar Dhanuskodi <diwakar.dhanusk...@gmail.com> To: Darin McBeath &

Re: Best way to read XML data from RDD

2016-08-21 Thread Darin McBeath
, or xslt to transform, extract, or filter. Like mentioned below, you want to initialize the parser in a mapPartitions call (one of the examples shows this). Hope this is helpful. Darin. From: Hyukjin Kwon <gurwls...@gmail.com> To: Jörn Franke <jornfra...@

RDD vs Dataset performance

2016-07-28 Thread Darin McBeath
use cases (such as the one I'm using above). Am I doing something wrong or is this to be expected? Thanks. Darin. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
repartition should work or if this is a bug. Thanks Jacek for starting to dig into this. Darin. - Original Message - From: Darin McBeath <ddmcbe...@yahoo.com.INVALID> To: Jacek Laskowski <ja...@japila.pl> Cc: user <user@spark.apache.org> Sent: Friday, March 11, 2016 1

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
rviceInit call"); log.info("SimpleStorageServiceInit call arg1: "+ arg1); log.info("SimpleStorageServiceInit call arg2:"+ arg2); log.info("SimpleStorageServiceInit call arg3: "+ arg3); SimpleStorageService.init(this.arg1, this.arg2, this.arg3); } } ___

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
dies. Darin. From: Jacek Laskowski <ja...@japila.pl> To: Darin McBeath <ddmcbe...@yahoo.com> Cc: user <user@spark.apache.org> Sent: Friday, March 11, 2016 1:24 PM Subject: Re: spark 1.6 foreachPartition only appears to be running on one executor

spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
-submit. Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Darin McBeath
of my files can have embedded newlines). But, I wonder how either of these would behave if I passed literally a million (or more) 'filenames'. Before I spend time exploring, I wanted to seek some input. Any thoughts would be appreciated. Darin

Spark 1.6 and Application History not working correctly

2016-01-13 Thread Darin McBeath
to view the application history for both jobs. Has anyone else noticed this issue? Any suggestions? Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: Spark 1.6 and Application History not working correctly

2016-01-13 Thread Darin McBeath
Thanks. I already set the following in spark-defaults.conf so I don't think that is going to fix my problem. spark.eventLog.dir file:///root/spark/applicationHistory spark.eventLog.enabled true I suspect my problem must be something else. Darin. From: Don

Re: Turning off DTD Validation using XML Utils package - Spark

2015-12-04 Thread Darin McBeath
e new getInstance function and some more information on the various features. Darin. From: Darin McBeath <ddmcbe...@yahoo.com.INVALID> To: "user@spark.apache.org" <user@spark.apache.org> Sent: Tuesday, December 1, 2015 11:51 AM Subject: Re: Turning off D

Re: Turning off DTD Validation using XML Utils package - Spark

2015-12-01 Thread Darin McBeath
) or you could have it find 'local' versions (on the workers or in S3 and then cache them locally for performance). I will post an update when the code has been adjusted. Darin. - Original Message - From: Shivalik <shivalik.malho...@outlook.com> To: user@spark.apache.org Sent: T

Re: Reading xml in java using spark

2015-09-01 Thread Darin McBeath
or.getInstance(xpath,namespaces) recsIter.map(rec => proc.evaluateString(rec._2).toInt) }).sum There is more documentation on the spark-xml-utils github site. Let me know if the documentation is not clear or if you have any questions. Darin. __

Please add the Cincinnati spark meetup to the list of meet ups

2015-07-07 Thread Darin McBeath
 http://www.meetup.com/Cincinnati-Apache-Spark-Meetup/ Thanks. Darin.

Running into several problems with Data Frames

2015-04-17 Thread Darin McBeath
, EntityType: String, CustomerId: String, EntityURI: String, NumDocs: Long) val entities = sc.textFile(s3n://darin/Entities.csv) val entitiesArr = entities.map(v = v.split('|')) val dfEntity = entitiesArr.map(arr = Entity(arr(0).toLong, arr(1).toLong, arr(2), arr(3), arr(4), arr(5).toLong)).toDF

repartitionAndSortWithinPartitions and mapPartitions and sort order

2015-03-12 Thread Darin McBeath
. Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Question about the spark assembly deployed to the cluster with the ec2 scripts

2015-03-05 Thread Darin McBeath
is for hadoop2.0.0 (and I think Cloudera). Is there a way that I can force the install of the same assembly to the cluster that comes with the 1.2 download of spark? Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: Question about Spark best practice when counting records.

2015-02-27 Thread Darin McBeath
Thanks for you quick reply. Yes, that would be fine. I would rather wait/use the optimal approach as opposed to hacking some one-off solution. Darin. From: Kostas Sakellis kos...@cloudera.com To: Darin McBeath ddmcbe...@yahoo.com Cc: User user

Question about Spark best practice when counting records.

2015-02-27 Thread Darin McBeath
a partition that I might over count, but perhaps that is an acceptable trade-off. I'm guessing that others have ran into this before so I would like to learn from the experience of others and how they have addressed this. Thanks. Darin

job keeps failing with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1

2015-02-25 Thread Darin McBeath
); If anyone has any tips for what I should look into it would be appreciated. Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Darin McBeath
Just to close the loop in case anyone runs into the same problem I had. By setting --hadoop-major-version=2 when using the ec2 scripts, everything worked fine. Darin. - Original Message - From: Darin McBeath ddmcbe...@yahoo.com.INVALID To: Mingyu Kim m...@palantir.com; Aaron Davidson

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Darin McBeath
is an interface of type org.apache.hadoop.mapred.JobContext. Is there something obvious that I might be doing wrong (or messed up in the translation from Scala to Java) or something I should look into? I'm using Spark 1.2 with hadoop 2.4. Thanks. Darin. From

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Darin McBeath
it and post a response. - Original Message - From: Mingyu Kim m...@palantir.com To: Darin McBeath ddmcbe...@yahoo.com; Aaron Davidson ilike...@gmail.com Cc: user@spark.apache.org user@spark.apache.org Sent: Monday, February 23, 2015 3:06 PM Subject: Re: Which OutputCommitter to use for S3? Cool

Incorrect number of records after left outer join (I think)

2015-02-19 Thread Darin McBeath
1.2 on a stand-alone cluster on ec2. To get the counts for the records, I'm using the .count() for the RDD. Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: How do you get the partitioner for an RDD in Java?

2015-02-17 Thread Darin McBeath
Thanks Imran. That's exactly what I needed to know. Darin. From: Imran Rashid iras...@cloudera.com To: Darin McBeath ddmcbe...@yahoo.com Cc: User user@spark.apache.org Sent: Tuesday, February 17, 2015 8:35 PM Subject: Re: How do you get the partitioner

MapValues and Shuffle Reads

2015-02-17 Thread Darin McBeath
()); log.info(Number of baseline records: + baselinePairRDD.count()); Both hsfBaselinePairRDD and baselinePairRDD have 1024 partitions. Any insights would be appreciated. I'm using Spark 1.2.0 in a stand-alone cluster. Darin

Re: MapValues and Shuffle Reads

2015-02-17 Thread Darin McBeath
really want to do in the first place. Thanks again for your insights. Darin. From: Imran Rashid iras...@cloudera.com To: Darin McBeath ddmcbe...@yahoo.com Cc: User user@spark.apache.org Sent: Tuesday, February 17, 2015 3:29 PM Subject: Re: MapValues and Shuffle

How do you get the partitioner for an RDD in Java?

2015-02-17 Thread Darin McBeath
of getting this information. I'm curious if anyone has any suggestions for how I might go about finding how an RDD is partitioned in a Java program. Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Problems saving a large RDD (1 TB) to S3 as a sequence file

2015-01-23 Thread Darin McBeath
0.3 s 323.4 MB If anyone has some suggestions please let me know. I've tried playing around with various configuration options but I've found nothing yet that will fix the underlying issue. Thanks. Darin

Re: Problems saving a large RDD (1 TB) to S3 as a sequence file

2015-01-23 Thread Darin McBeath
,30); From: Sven Krasser kras...@gmail.com To: Darin McBeath ddmcbe...@yahoo.com Cc: User user@spark.apache.org Sent: Friday, January 23, 2015 5:12 PM Subject: Re: Problems saving a large RDD (1 TB) to S3 as a sequence file Hey Darin, Are you running

Confused about shuffle read and shuffle write

2015-01-21 Thread Darin McBeath
about is why there is even any 'shuffle read' when constructing the baselinePairRDD. If anyone could shed some light on this it would be appreciated. Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Confused about shuffle read and shuffle write

2015-01-20 Thread Darin McBeath
confused about is why there is even any 'shuffle read' when constructing the baselinePairRDD.  If anyone could shed some light on this it would be appreciated. Thanks. Darin.

Re: Please help me get started on Apache Spark

2014-11-20 Thread Darin McBeath
Take a look at the O'Reilly Learning Spark (Early Release) book.  I've found this very useful. Darin. From: Saurabh Agrawal saurabh.agra...@markit.com To: user@spark.apache.org user@spark.apache.org Sent: Thursday, November 20, 2014 9:04 AM Subject: Please help me get started on Apache

Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Darin McBeath
be appreciated.  Thanks. Darin. Here is where I'm setting the 'timeout' in my spark job. SparkConf conf = new SparkConf().setAppName(SparkSync Application).set(spark.serializer,  org.apache.spark.serializer.KryoSerializer).set(spark.rdd.compress,true)   .set

ec2 script and SPARK_LOCAL_DIRS not created

2014-11-12 Thread Darin McBeath
for /mnt/spark and /mnt2/spark do not exist. Am I missing something?  Has anyone else noticed this? A colleague was started a cluster (using the ec2 scripts) but for m3.xlarge machines and both /mnt/spark and /mnt2/spark directories were created. Thanks. Darin.

What should be the number of partitions after a union and a subtractByKey

2014-11-11 Thread Darin McBeath
Assume the following where both updatePairRDD and deletePairRDD are both HashPartitioned.  Before the union, each one of these has 512 partitions.   The new created updateDeletePairRDD has 1024 partitions.  Is this the general/expected behavior for a union (the number of partitions to double)?

Question about RDD Union and SubtractByKey

2014-11-10 Thread Darin McBeath
.  But, based on other logging statements throughout my application, I don't believe this is the case. Thanks. Darin. 14/11/10 22:35:27 INFO scheduler.DAGScheduler: Failed to run count at SparkSync.java:78514/11/10 22:35:27 WARN scheduler.TaskSetManager: Lost task 0.3 in stage 40.0 (TID 10674, ip-10

Cincinnati, OH Meetup for Apache Spark

2014-11-03 Thread Darin McBeath
Let me know if you  are interested in participating in a meet up in Cincinnati, OH to discuss Apache Spark. We currently have 4-5 different companies expressing interest but would like a few more. Darin.

XML Utilities for Apache Spark

2014-10-29 Thread Darin McBeath
-shell as well as from a Java application.  Feel free to use, contribute, and/or let us know how this library can be improved.  Let me know if you have any questions. Darin.

Spark SQL and confused about number of partitions/tasks to do a simple join.

2014-10-29 Thread Darin McBeath
. Thanks. Darin.

Re: Spark SQL and confused about number of partitions/tasks to do a simple join.

2014-10-29 Thread Darin McBeath
ok. after reading some documentation, it would appear the issue is the default number of partitions for a join (200). After doing something like the following, I was able to change the value. From: Darin McBeath ddmcbe...@yahoo.com.INVALID To: User user@spark.apache.org Sent: Wednesday

Re: Spark SQL and confused about number of partitions/tasks to do a simple join.

2014-10-29 Thread Darin McBeath
Sorry, hit the send key a bitt too early. Anyway, this is the code I set. sqlContext.sql(set spark.sql.shuffle.partitions=10); From: Darin McBeath ddmcbe...@yahoo.com To: Darin McBeath ddmcbe...@yahoo.com; User user@spark.apache.org Sent: Wednesday, October 29, 2014 2:47 PM Subject: Re

what's the best way to initialize an executor?

2014-10-23 Thread Darin McBeath
with XPathProcessor.init.  I have code in place to make sure this is not an issue.  But, I was wondering if there is a better way to accomplish something like this. Thanks. Darin.

confused about memory usage in spark

2014-10-22 Thread Darin McBeath
using 297MB for this RDD (when it was only 156MB in S3).  I get that there could be some differences between the serialized storage format and what is then used in memory, but I'm curious as to whether I'm missing something and/or should be doing things differently. Thanks. Darin.

Disabling log4j in Spark-Shell on ec2 stopped working on Wednesday (Oct 8)

2014-10-10 Thread Darin McBeath
ec2 startup script. But, that is purely a guess on my part. I'm wondering if anyone else has noticed this issue and if so has a workaround. Thanks. Darin.

How to pass env variables from master to executors within spark-shell

2014-08-20 Thread Darin McBeath
within spark-shell, this really isn't an option (unless I'm missing something).   Thanks. Darin.

Issues with S3 client library and Apache Spark

2014-08-15 Thread Darin McBeath
to this problem. Thanks. Darin. java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V at org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator

Should the memory of worker nodes be constrained to the size of the master node?

2014-08-14 Thread Darin McBeath
. Darin.

Is there any interest in handling XML within Spark ?

2014-08-13 Thread Darin McBeath
to executors) root@ip-10-233-73-204 spark]$ ./bin/spark-shell --jars lib/uber-SparkUtils-0.1.jar ## Bring in the sequence file (2 million records) scala val xmlKeyPair = sc.sequenceFile[String,String](s3n://darin/xml/part*).cache() ## Test values against an xpath expression (need to import the the class

Re: Number of partitions and Number of concurrent tasks

2014-07-31 Thread Darin McBeath
=.08 -z us-east-1e --worker-instances=2 my-cluster From: Daniel Siegmann daniel.siegm...@velos.io To: Darin McBeath ddmcbe...@yahoo.com Cc: Daniel Siegmann daniel.siegm...@velos.io; user@spark.apache.org user@spark.apache.org Sent: Thursday, July 31, 2014 10:04

Number of partitions and Number of concurrent tasks

2014-07-30 Thread Darin McBeath
for the running application but this had no effect.  Perhaps, this is ignored for a 'filter' and the default is the total number of cores available. I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental.  Any help would be appreciated. Thanks. Darin.

Re: Number of partitions and Number of concurrent tasks

2014-07-30 Thread Darin McBeath
the documentation states).  What would I want that value to be based on my configuration below?  Or, would I leave that alone? From: Daniel Siegmann daniel.siegm...@velos.io To: user@spark.apache.org; Darin McBeath ddmcbe...@yahoo.com Sent: Wednesday, July 30, 2014