Re: json_tuple fails to parse string with emoji

2017-01-26 Thread Andrew Ehrlich
It looks like I'm hitting this bug in jackson-core 2.2.3 which is included in the version of CDH I'm on: https://github.com/FasterXML/jackson-core/issues/115 Jackson-core 2.3.0 has the fix. On Tue, Jan 24, 2017 at 5:14 PM, Andrew Ehrlich <and...@aehrlich.com> wrote: > On Spark 1.6.0

json_tuple fails to parse string with emoji

2017-01-24 Thread Andrew Ehrlich
On Spark 1.6.0, calling json_tuple() with an emoji character in one of the values returns nulls: Input: """ "myJsonBody": { "field1": "" } """ Query: """ ... LATERAL VIEW JSON_TUPLE(e.myJsonBody,'field1') k AS field1, ... """ This looks like a platform-dependent issue; the parsing

Re: Changing Spark configuration midway through application.

2016-08-10 Thread Andrew Ehrlich
If you're changing properties for the SparkContext, then I believe you will have to start a new SparkContext with the new properties. On Wed, Aug 10, 2016 at 8:47 AM, Jestin Ma wrote: > If I run an application, for example with 3 joins: > > [join 1] > [join 2] > [join

Re: Tuning level of Parallelism: Increase or decrease?

2016-07-31 Thread Andrew Ehrlich
15000 seems like a lot of tasks for that size. Test it out with a .coalesce(50) placed right after loading the data. It will probably either run faster or crash with out of memory errors. > On Jul 29, 2016, at 9:02 AM, Jestin Ma wrote: > > I am processing ~2 TB of

Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-31 Thread Andrew Ehrlich
You could write each image to a different directory instead of a different file. That can be done by filtering the RDD into one RDD for each image and then saving each. That might not be what you’re after though, in terms of space and speed efficiency. Another way would be to save them multiple

Re: Bzip2 to Parquet format

2016-07-24 Thread Andrew Ehrlich
You can load the text with sc.textFile() to an RDD[String], then use .map() to convert it into an RDD[Row]. At this point you are ready to apply a schema. Use sqlContext.createDataFrame(rddOfRow, structType) Here is an example on how to define the StructType (schema) that you will combine with

Re: Size exceeds Integer.MAX_VALUE

2016-07-24 Thread Andrew Ehrlich
is the training data set for random forest training, about > 36,500 data, any idea how to further partition it? > > On Sun, Jul 24, 2016 at 12:31 PM, Andrew Ehrlich <and...@aehrlich.com > <mailto:and...@aehrlich.com>> wrote: > It may be this issue: https://issues.a

Re: Size exceeds Integer.MAX_VALUE

2016-07-23 Thread Andrew Ehrlich
It may be this issue: https://issues.apache.org/jira/browse/SPARK-6235 which limits the size of the blocks in the file being written to disk to 2GB. If so, the solution is for you to try tuning for smaller tasks. Try increasing the number of

Re: How to generate a sequential key in rdd across executors

2016-07-23 Thread Andrew Ehrlich
It’s hard to do in a distributed system. Maybe try generating a meaningful key using a timestamp + hashed unique key fields in the record? > On Jul 23, 2016, at 7:53 PM, yeshwanth kumar wrote: > > Hi, > > i am doing bulk load to hbase using spark, > in which i need to

Re: spark and plot data

2016-07-23 Thread Andrew Ehrlich
@Gourav, did you find any good inline plotting tools when using the Scala kernel? I found one based on highcharts but it was not frictionless the way matplotlib is. > On Jul 23, 2016, at 2:26 AM, Gourav Sengupta > wrote: > > Hi Pedro, > > Toree is Scala kernel for

Re: Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread Andrew Ehrlich
+1 for the misleading error. Messages about failing to connect often mean that an executor has died. If so, dig into the executor logs and find out why the executor died (out of memory, perhaps). Andrew > On Jul 23, 2016, at 11:39 AM, VG wrote: > > Hi Pedro, > > Based on

Re: How to give name to Spark jobs shown in Spark UI

2016-07-23 Thread Andrew Ehrlich
As far as I know, the best you can do is refer to the Actions by line number. > On Jul 23, 2016, at 8:47 AM, unk1102 wrote: > > Hi I have multiple child spark jobs run at a time. Is there any way to name > these child spark jobs so I can identify slow running ones. For e.

Re: Spark Job trigger in production

2016-07-19 Thread Andrew Ehrlich
Another option is Oozie with the spark action: https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html > On Jul 18, 2016, at 12:15 AM, Jagat Singh wrote: > > You can use following options > > *

Re: the spark job is so slow - almost frozen

2016-07-19 Thread Andrew Ehrlich
Try: - filtering down the data as soon as possible in the job, dropping columns you don’t need. - processing fewer partitions of the hive tables at a time - caching frequently accessed data, for example dimension tables, lookup tables, or other datasets that are repeatedly accessed - using the

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-19 Thread Andrew Ehrlich
There is a Spark<->HBase library that does this. I used it once in a prototype (never tried in production through): http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/

Re: Heavy Stage Concentration - Ends With Failure

2016-07-19 Thread Andrew Ehrlich
Yea this is a good suggestion; also check 25th percentile, median, and 75th percentile to see how skewed the input data is. If you find that the RDD’s partitions are skewed you can solve it either by changing the partitioner when you read the files like already suggested, or call repartition()

Re: spark worker continuously trying to connect to master and failed in standalone mode

2016-07-19 Thread Andrew Ehrlich
Troubleshooting steps: $ telnet localhost 7077 (on master, to confirm port is open) $ telnet 7077 (on slave, to confirm port is blocked) If the port is available on the master from the master, but not on the master from the slave, check firewall settings on the master:

Re: Building standalone spark application via sbt

2016-07-19 Thread Andrew Ehrlich
Yes, spark-core will depend on Hadoop and several other jars. Here’s the list of dependencies: https://github.com/apache/spark/blob/master/core/pom.xml#L35 Whether you need spark-sql depends on whether you will use the DataFrame

Re: Spark performance testing

2016-07-08 Thread Andrew Ehrlich
ttps://github.com/databricks/spark-perf > https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf > http://people.cs.vt.edu/~butta/docs/tpctc2015-sparkbench.pdf > > >> On Sat, Jul 9, 2016 at 11:40 AM, Andrew Ehrlich &

Spark performance testing

2016-07-08 Thread Andrew Ehrlich
Hi group, What solutions are people using to do performance testing and tuning of spark applications? I have been doing a pretty manual technique where I lay out an Excel sheet of various memory settings and caching parameters and then execute each one by hand. It’s pretty tedious though, so

Re: never understand

2016-05-25 Thread Andrew Ehrlich
- Try doing less in each transformation - Try using different data structures within the transformations - Try not caching anything to free up more memory On Wed, May 25, 2016 at 1:32 AM, pseudo oduesp wrote: > hi guys , > -i get this errors with pyspark 1.5.0 under

Re: subtractByKey increases RDD size in memory - any ideas?

2016-02-18 Thread Andrew Ehrlich
There could be clues in the different RDD subclasses; rdd1 is ParallelCollectionRDD but rdd3 is SubtractedRDD. On Thu, Feb 18, 2016 at 1:37 PM, DaPsul wrote: > (copy from > > http://stackoverflow.com/questions/35467128/spark-subtractbykey-increases-rdd-cached-memory-size > ) > >

Re: Hive REGEXP_REPLACE use or equivalent in Spark

2016-02-18 Thread Andrew Ehrlich
Use the scala method .split(",") to split the string into a collection of strings, and try using .replaceAll() on the field with the "?" to remove it. On Thu, Feb 18, 2016 at 2:09 PM, Mich Talebzadeh wrote: > Hi, > > What is the equivalent of this Hive statement in Spark >

Re: send transformed RDD to s3 from slaves

2015-11-14 Thread Andrew Ehrlich
Maybe you want to be using rdd.saveAsTextFile() ? > On Nov 13, 2015, at 4:56 PM, Walrus theCat wrote: > > Hi, > > I have an RDD which crashes the driver when being collected. I want to send > the data on its partitions out to S3 without bringing it back to the driver.