Any way for users to help "stuck" JIRAs with pull requests for Spark 2.3 / future releases?

2017-12-21 Thread Ewan Leith
Hi all, I was wondering with the approach of Spark 2.3 if there's any way us "regular" users can help advance any of JIRAs that could have made it into Spark 2.3 but are likely to miss now as the pull requests are awaiting detailed review. For example:

RE: SparkUI via proxy

2016-11-25 Thread Ewan Leith
This is more of a question for the spark user’s list, but if you look at FoxyProxy and SSH tunnels it’ll get you going. These instructions from AWS for accessing EMR are a good start http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html

[Bug 1627769] Re: limits.conf not applied

2016-11-18 Thread Ewan Leith
I've found that editting /etc/systemd/user.conf and inserting lines such as DefaultLimitNOFILE=4 changes the ulimit for the graphical user sessions -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu.

[Desktop-packages] [Bug 1627769] Re: limits.conf not applied

2016-11-18 Thread Ewan Leith
I've found that editting /etc/systemd/user.conf and inserting lines such as DefaultLimitNOFILE=4 changes the ulimit for the graphical user sessions -- You received this bug notification because you are a member of Desktop Packages, which is subscribed to lightdm in Ubuntu.

[Touch-packages] [Bug 1627769] Re: limits.conf not applied

2016-11-18 Thread Ewan Leith
I've found that editting /etc/systemd/user.conf and inserting lines such as DefaultLimitNOFILE=4 changes the ulimit for the graphical user sessions -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to lightdm in Ubuntu.

[Touch-packages] [Bug 1627769] Re: limits.conf not applied

2016-11-15 Thread Ewan Leith
Still seems to be an issue for me, logging into the terminal I have the modified ulimit values, but not in a lightdm session -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to lightdm in Ubuntu.

[Bug 1627769] Re: limits.conf not applied

2016-11-15 Thread Ewan Leith
Still seems to be an issue for me, logging into the terminal I have the modified ulimit values, but not in a lightdm session -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1627769 Title: limits.conf

[Desktop-packages] [Bug 1627769] Re: limits.conf not applied

2016-11-15 Thread Ewan Leith
Still seems to be an issue for me, logging into the terminal I have the modified ulimit values, but not in a lightdm session -- You received this bug notification because you are a member of Desktop Packages, which is subscribed to lightdm in Ubuntu. https://bugs.launchpad.net/bugs/1627769

Re: Spark 2.0.1 release?

2016-09-16 Thread Ewan Leith
early next week for rc. On Fri, Sep 16, 2016 at 11:16 AM, Ewan Leith <ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote: Hi all, Apologies if I've missed anything, but is there likely to see a 2.0.1 bug fix release, or does a jump to 2.1.0 with additional

Spark 2.0.1 release?

2016-09-16 Thread Ewan Leith
Hi all, Apologies if I've missed anything, but is there likely to see a 2.0.1 bug fix release, or does a jump to 2.1.0 with additional features seem more probable? The issues for 2.0.1 seem pretty much done here

[jira] [Commented] (SPARK-13721) Add support for LATERAL VIEW OUTER explode()

2016-09-05 Thread Ewan Leith (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464427#comment-15464427 ] Ewan Leith commented on SPARK-13721: Assuming Don's use case is the same as ours, we have to do odd

[jira] [Commented] (SPARK-17313) Support spark-shell on cluster mode

2016-08-30 Thread Ewan Leith (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450194#comment-15450194 ] Ewan Leith commented on SPARK-17313: I think Apache Zeppelin and Spark Notebook both cover

[jira] [Commented] (SPARK-17099) Incorrect result when HAVING clause is added to group by query

2016-08-17 Thread Ewan Leith (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15424277#comment-15424277 ] Ewan Leith commented on SPARK-17099: I've done a quick test in Spark 1.6.1 and this produces

Re: How to resolve the SparkExecption : Size exceeds Integer.MAX_VALUE

2016-08-15 Thread Ewan Leith
I think this is more suited to the user mailing list than the dev one, but this almost always means you need to repartition your data into smaller partitions as one of the partitions is over 2GB. When you create your dataset, put something like . repartition(1000) at the end of the command

Re: zip for pyspark

2016-08-08 Thread Ewan Leith
If you build a normal python egg file with the dependencies, you can execute that like you are executing a .py file with --py-files Thanks, Ewan On 8 Aug 2016 3:44 p.m., pseudo oduesp wrote: hi, how i can export all project on pyspark like zip from local session to

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-07 Thread Ewan Leith
Looking at the encoders api documentation at http://spark.apache.org/docs/latest/api/java/ == Java == Encoders are specified by calling static methods on Encoders. List data = Arrays.asList("abc", "abc", "xyz");

RE: how to save spark files as parquets efficiently

2016-07-29 Thread Ewan Leith
If you replace the df.write …. With df.count() in your code you’ll see how much time is taken to process the full execution plan without the write output. That code below looks perfectly normal for writing a parquet file yes, there shouldn’t be any tuning needed for “normal” performance.

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-23 Thread Ewan Leith
nclude you in that. I will document this as a known issue in the release notes. We have other bugs that we have fixed since RC5, and we can fix those together in 2.0.1. On July 22, 2016 at 10:24:32 PM, Ewan Leith (ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>) wrote: I th

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-22 Thread Ewan Leith
I think this new issue in JIRA blocks the release unfortunately? https://issues.apache.org/jira/browse/SPARK-16664 - Persist call on data frames with more than 200 columns is wiping out the data Otherwise there'll need to be 2.0.1 pretty much right after? Thanks, Ewan On 23 Jul 2016 03:46,

RE: Role-based S3 access outside of EMR

2016-07-21 Thread Ewan Leith
If you use S3A rather than S3N, it supports IAM roles. I think you can make s3a used for s3:// style URLs so it’s consistent with your EMR paths by adding this to your Hadoop config, probably in core-site.xml: fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem

RE: is dataframe.write() async? Streaming performance problem

2016-07-08 Thread Ewan Leith
Writing (or reading) small files from spark to s3 can be seriously slow. You'll get much higher throughput by doing a df.foreachPartition(partition => ...) and inside each partition, creating an aws s3 client then doing a partition.foreach and uploading the files using that s3 client with its

[jira] [Commented] (SPARK-16363) Spark-submit doesn't work with IAM Roles

2016-07-05 Thread Ewan Leith (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363249#comment-15363249 ] Ewan Leith commented on SPARK-16363: I'm not sure this is a major issue, but try running

Re: Spark SQL Nested Array of JSON with empty field

2016-06-05 Thread Ewan Leith
The spark json read is unforgiving of things like missing elements from some json records, or mixed types. If you want to pass invalid json files through spark you're best doing an initial parse through the Jackson APIs using a defined schema first, then you can set types like Option[String]

RE: Timed aggregation in Spark

2016-05-23 Thread Ewan Leith
Rather than open a connection per record, if you do a DStream foreachRDD at the end of a 5 minute batch window http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams then you can do a rdd.foreachPartition to get the RDD partitions. Open a connection

Spark Streaming - Exception thrown while writing record: BlockAdditionEvent

2016-05-23 Thread Ewan Leith
As we increase the throughput on our Spark streaming application, we're finding we hit errors with the WriteAheadLog, with errors like this: 16/05/21 20:42:21 WARN scheduler.ReceivedBlockTracker: Exception thrown while writing record:

RE: Spark 1.6.0: substring on df.select

2016-05-12 Thread Ewan Leith
You could use a UDF pretty easily, something like this should work, the lastElement function could be changed to do pretty much any string manipulation you want. import org.apache.spark.sql.functions.udf def lastElement(input: String) = input.split("/").last val lastElementUdf =

RE: Parse Json in Spark

2016-05-09 Thread Ewan Leith
The simplest way is probably to use the sc.binaryFiles or sc.wholeTextFiles API to create an RDD containing the JSON files (maybe need a sc.wholeTextFiles(…).map(x => x._2) to drop off the filename column) then do a sqlContext.read.json(rddName) That way, you don’t need to worry about

RE: Spark streaming - update configuration while retaining write ahead log data?

2016-03-15 Thread Ewan Leith
That’s what I thought, it’s a shame! Thanks Saisai, Ewan From: Saisai Shao [mailto:sai.sai.s...@gmail.com] Sent: 15 March 2016 09:22 To: Ewan Leith <ewan.le...@realitymine.com> Cc: user <user@spark.apache.org> Subject: Re: Spark streaming - update configuration while retaining writ

Spark streaming - update configuration while retaining write ahead log data?

2016-03-15 Thread Ewan Leith
Has anyone seen a way of updating the Spark streaming job configuration while retaining the existing data in the write ahead log? e.g. if you've launched a job without enough executors and a backlog has built up in the WAL, can you increase the number of executors without losing the WAL data?

[jira] [Created] (SPARK-13623) Relaxed mode for querying Dataframes, so columns that don't exist or have an incompatible schema return null rather than error

2016-03-02 Thread Ewan Leith (JIRA)
Ewan Leith created SPARK-13623: -- Summary: Relaxed mode for querying Dataframes, so columns that don't exist or have an incompatible schema return null rather than error Key: SPARK-13623 URL: https

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Ewan Leith
a few times. Can you create a JIRA ticket so we can track it? Would be even better if you are interested in working on a patch! Thanks. On Wed, Mar 2, 2016 at 11:51 AM, Ewan Leith <ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote: Hi Reynold, yes that woul

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Ewan Leith
for fields that doesn't exist or have incompatible schema? On Wed, Mar 2, 2016 at 11:12 AM, Ewan Leith <ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote: Thanks Michael, it's not a great example really, as the data I'm working with has some source files

Re: SFTP Compressed CSV into Dataframe

2016-03-02 Thread Ewan Leith
The Apache Commons library will let you access files on an SFTP server via a Java library, no local file handling involved https://commons.apache.org/proper/commons-vfs/filesystems.html Hope this helps, Ewan I wonder if anyone has opened a SFTP connection to open a remote GZIP CSV file? I am

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Ewan Leith
2016 at 1:44 AM, Ewan Leith <ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote: When you create a dataframe using the sqlContext.read.schema() API, if you pass in a schema that's compatible with some of the records, but incompatible with others, it seems

Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Ewan Leith
When you create a dataframe using the sqlContext.read.schema() API, if you pass in a schema that's compatible with some of the records, but incompatible with others, it seems you can't do a .select on the problematic columns, instead you get an AnalysisException error. I know loading the wrong

RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2016-01-26 Thread Ewan Leith
I’ve just tried running this using a normal stdin redirect: ~/spark/bin/spark-shell < simple.scala Which worked, it started spark-shell, executed the script, the stopped the shell. Thanks, Ewan From: Iulian Dragoș [mailto:iulian.dra...@typesafe.com] Sent: 26 January 2016 15:00 To:

RE: Write to S3 with server side encryption in KMS mode

2016-01-26 Thread Ewan Leith
Hi Nisrina, I’m not aware of any support for KMS keys in s3n, s3a or the EMR specific EMRFS s3 driver. If you’re using EMRFS with Amazon’s EMR, you can use KMS keys with client-side encryption http://docs.aws.amazon.com/kms/latest/developerguide/services-emr.html#emrfs-encrypt If this has

RE: Spark 1.6.1

2016-01-25 Thread Ewan Leith
Hi Brandon, It's relatively straightforward to try out different type options for this in the spark-shell, try pasting the attached code into spark-shell before you make a normal postgres JDBC connection. You can then experiment with the mappings without recompiling Spark or anything like

[jira] [Comment Edited] (SPARK-12764) XML Column type is not supported

2016-01-12 Thread Ewan Leith (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093836#comment-15093836 ] Ewan Leith edited comment on SPARK-12764 at 1/12/16 12:53 PM: -- What are you

[jira] [Commented] (SPARK-12764) XML Column type is not supported

2016-01-12 Thread Ewan Leith (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093836#comment-15093836 ] Ewan Leith commented on SPARK-12764: What are you expecting it to do, output the XML as a string

[jira] [Comment Edited] (SPARK-12764) XML Column type is not supported

2016-01-12 Thread Ewan Leith (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093836#comment-15093836 ] Ewan Leith edited comment on SPARK-12764 at 1/12/16 12:57 PM: -- What are you

RE: Out of memory issue

2016-01-06 Thread Ewan Leith
Hi Muthu, this could be related to a known issue in the release notes http://spark.apache.org/releases/spark-release-1-6-0.html Known issues SPARK-12546 - Save DataFrame/table as Parquet with dynamic partitions may cause OOM; this can be worked around by decreasing the memory used by both

RE: How to accelerate reading json file?

2016-01-06 Thread Ewan Leith
If you already know the schema, then you can run the read with the schema parameter like this: val path = "examples/src/main/resources/jsonfile" val jsonSchema = StructType( StructField("id",StringType,true) :: StructField("reference",LongType,true) ::

Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ewan Leith
Hi all, I'm sure this must have been solved already, but I can't see anything obvious. Using Spark Streaming, I'm trying to execute a transform function on a DStream at short batch intervals (e.g. 1 second), but only write the resulting data to disk using saveAsTextFiles in a larger batch

RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ewan Leith
, thanks. Thanks, Ewan From: Ashic Mahtab [mailto:as...@live.com] Sent: 31 December 2015 13:50 To: Ewan Leith <ewan.le...@realitymine.com>; Apache Spark <user@spark.apache.org> Subject: RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functi

[jira] [Commented] (SPARK-11948) Permanent UDF not work

2015-12-14 Thread Ewan Leith (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-11948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056155#comment-15056155 ] Ewan Leith commented on SPARK-11948: I think this is a duplicate of SPARK-11609 ? > Permanent

RE: Size exceeds Integer.MAX_VALUE on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Ewan Leith
How big do you expect the file to be? Spark has issues with single blocks over 2GB (see https://issues.apache.org/jira/browse/SPARK-1476 and https://issues.apache.org/jira/browse/SPARK-6235 for example) If you don’t know, try running df.repartition(100).write.format… to get an idea of how

RE: Dataframe nested schema inference from Json without type conflicts

2015-10-23 Thread Ewan Leith
llable = true) |-- long: string (nullable = true) |-- null: string (nullable = true) |-- string: string (nullable = true) Thanks, Ewan From: Yin Huai [mailto:yh...@databricks.com] Sent: 01 October 2015 23:54 To: Ewan Leith <ewan.le...@realitymine.com> Cc: r...@databricks.com; dev@spark.apac

RE: Spark Streaming - use the data in different jobs

2015-10-19 Thread Ewan Leith
Storing the data in HBase, Cassandra, or similar is possibly the right answer, the other option that can work well is re-publishing the data back into second queue on RabbitMQ, to be read again by the next job. Thanks, Ewan From: Oded Maimon [mailto:o...@scene53.com] Sent: 18 October 2015

RE: Should I convert json into parquet?

2015-10-19 Thread Ewan Leith
As Jörn says, Parquet and ORC will get you really good compression and can be much faster. There also some nice additions around predicate pushdown which can be great if you've got wide tables. Parquet is obviously easier to use, since it's bundled into Spark. Using ORC is described here

[jira] [Created] (SPARK-10947) With schema inference from JSON into a Dataframe, add option to infer all primitive object types as strings

2015-10-06 Thread Ewan Leith (JIRA)
Ewan Leith created SPARK-10947: -- Summary: With schema inference from JSON into a Dataframe, add option to infer all primitive object types as strings Key: SPARK-10947 URL: https://issues.apache.org/jira/browse/SPARK

Re: Dataframe nested schema inference from Json without type conflicts

2015-10-05 Thread Ewan Leith
Thanks Yin, I'll put together a JIRA and a PR tomorrow. Ewan -- Original message-- From: Yin Huai Date: Mon, 5 Oct 2015 17:39 To: Ewan Leith; Cc: dev@spark.apache.org; Subject:Re: Dataframe nested schema inference from Json without type conflicts Hello Ewan, Adding a JSON

RE: Dataframe nested schema inference from Json without type conflicts

2015-10-05 Thread Ewan Leith
tly works, does anyone think a pull request would plausibly get into the Spark main codebase? Thanks, Ewan From: Ewan Leith [mailto:ewan.le...@realitymine.com] Sent: 02 October 2015 01:57 To: yh...@databricks.com Cc: r...@databricks.com; dev@spark.apache.org Subject: Re: Dataframe nested sch

Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Ewan Leith
Hi all, We really like the ability to infer a schema from JSON contained in an RDD, but when we're using Spark Streaming on small batches of data, we sometimes find that Spark infers a more specific type than it should use, for example if the json in that small batch only contains integer

Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Ewan Leith
probably have to adopt if we can't come up with a way to keep the inference working. Thanks, Ewan -- Original message-- From: Reynold Xin Date: Thu, 1 Oct 2015 22:12 To: Ewan Leith; Cc: dev@spark.apache.org; Subject:Re: Dataframe nested schema inference from Json without type

Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Ewan Leith
Exactly, that's a much better way to put it. Thanks, Ewan -- Original message-- From: Yin Huai Date: Thu, 1 Oct 2015 23:54 To: Ewan Leith; Cc: r...@databricks.com;dev@spark.apache.org; Subject:Re: Dataframe nested schema inference from Json without type conflicts Hi Ewan

RE: Need for advice - performance improvement and out of memory resolution

2015-09-30 Thread Ewan Leith
Try reducing the number of workers to 2, and increasing their memory up to 6GB. However I've seen mention of a bug in the pyspark API for when calling head() on a dataframe in spark 1.5.0 and 1.4, it's got a big performance hit. https://issues.apache.org/jira/browse/SPARK-10731 It's fixed in

RE: Converting a DStream to schemaRDD

2015-09-29 Thread Ewan Leith
Something like: dstream.foreachRDD { rdd => val df = sqlContext.read.json(rdd) df.select(…) } https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams Might be the place to start, it’ll convert each batch of dstream into an RDD then let you work

SQLContext.read().json() inferred schema - force type to strings?

2015-09-25 Thread Ewan Leith
Hi all, We're uising SQLContext.read.json to read in a stream of JSON datasets, but sometimes the inferred schema contains for the same value a LongType, and sometimes a DoubleType. This obviously causes problems with merging the schema, so does anyone know a way of forcing the inferred

Re: Zeppelin on Yarn : org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submi

2015-09-19 Thread Ewan Leith
yarn-client still runs the executor tasks on the cluster, the main difference is where the driver job runs. Thanks, Ewan -- Original message-- From: shahab Date: Fri, 18 Sep 2015 13:11 To: Aniket Bhatnagar; Cc: user@spark.apache.org; Subject:Re: Zeppelin on Yarn :

Re: [Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-10 Thread Ewan Leith
The last time I checked, if you launch EMR 4 with only Spark selected as an application, HDFS isn't correctly installed. Did you select another application like Hive at launch time as well as Spark? If not, try that. Thanks, Ewan -- Original message-- From: Dean Wampler Date:

RE: NOT IN in Spark SQL

2015-09-04 Thread Ewan Leith
Spark SQL doesn’t support “NOT IN”, but I think HiveQL does, so give using the HiveContext a try rather than SQLContext. Here’s the spark 1.2 docs on it, but it’s basically identical to running the SQLContext https://spark.apache.org/docs/1.2.0/sql-programming-guide.html#tab_scala_6

spark-csv package - output to filename.csv?

2015-09-03 Thread Ewan Leith
Using the spark-csv package or outputting to text files, you end up with files named: test.csv/part-00 rather than a more user-friendly "test.csv", even if there's only 1 part file. We can merge the files using the Hadoop merge command with something like this code from

RE: How to Take the whole file as a partition

2015-09-03 Thread Ewan Leith
Have a look at the sparkContext.binaryFiles, it works like wholeTextFiles but returns a PortableDataStream per file. It might be a workable solution though you'll need to handle the binary to UTF-8 or equivalent conversion Thanks, Ewan From: Shuai Zheng [mailto:szheng.c...@gmail.com] Sent: 03

Re: Problem while loading saved data

2015-09-03 Thread Ewan Leith
>From that, I'd guesd that HDFS isn't setup between the nodes, or for some >reason writes are defaulting to file:///path/ rather than hdfs:///path/ -- Original message-- From: Amila De Silva Date: Thu, 3 Sep 2015 17:12 To: Ewan Leith; Cc: user@spark.apache.org; Subj

RE: Problem while loading saved data

2015-09-03 Thread Ewan Leith
Your error log shows you attempting to read from 'people.parquet2' not ‘people.parquet’ as you’ve put below, is that just from a different attempt? Otherwise, it’s an odd one! There aren’t _SUCCESS, _common_metadata and _metadata files under people.parquet that you’ve listed below, which would

RE: spark 1.4.1 saveAsTextFile (and Parquet) is slow on emr-4.0.0

2015-09-03 Thread Ewan Leith
For those who have similar issues on EMR writing Parquet files, if you update mapred-site.xml with the following lines: mapred.output.direct.EmrFileSystemtrue mapred.output.direct.NativeS3FileSystemtrue parquet.enable.summary-metadatafalse

[jira] [Created] (SPARK-10419) Add SQLServer JdbcDialect support for datetimeoffset types

2015-09-02 Thread Ewan Leith (JIRA)
Ewan Leith created SPARK-10419: -- Summary: Add SQLServer JdbcDialect support for datetimeoffset types Key: SPARK-10419 URL: https://issues.apache.org/jira/browse/SPARK-10419 Project: Spark Issue

RE: How to increase the Json parsing speed

2015-08-28 Thread Ewan Leith
Can you post roughly what you’re running as your Spark code? One issue I’ve seen before is that passing a directory full of files as a path “/path/to/files/” can be slow, while “/path/to/files/*” runs fast. Also, if you’ve not seen it, have a look at the binaryFiles call

RE: correct use of DStream foreachRDD

2015-08-28 Thread Ewan Leith
I think what you’ll want is to carry out the .map functions before the foreachRDD, something like: val lines = ssc.textFileStream(/stream).map(Sensor.parseSensor).map(Sensor.convertToPut) lines.foreachRDD { rdd = // parse the line of data into sensor object

RE: Driver running out of memory - caused by many tasks?

2015-08-27 Thread Ewan Leith
Are you using the Kryo serializer? If not, have a look at it, it can save a lot of memory during shuffles https://spark.apache.org/docs/latest/tuning.html I did a similar task and had various issues with the volume of data being parsed in one go, but that helped a lot. It looks like the main

Selecting different levels of nested data records during one select?

2015-08-27 Thread Ewan Leith
Hello, I'm trying to query a nested data record of the form: root |-- userid: string (nullable = true) |-- datarecords: array (nullable = true) ||-- element: struct (containsNull = true) |||-- name: string (nullable = true) |||-- system: boolean (nullable = true) ||

RE: Create column in nested structure?

2015-08-13 Thread Ewan Leith
Never mind me, I've found an email to this list from Raghavendra Pandey which got me what I needed val nestedCol = struct(df(nested2.column1), df(nested2.column2), df(flatcolumn)) val df2 = df.select(df(nested1), nestedCol as nested2) Thanks, Ewan From: Ewan Leith Sent: 13 August 2015 15:44

Create column in nested structure?

2015-08-13 Thread Ewan Leith
Has anyone used withColumn (or another method) to add a column to an existing nested dataframe? If I call: df.withColumn(nested.newcolumn, df(oldcolumn)) then it just creates the new column with a . In it's name, not under the nested structure. Thanks, Ewan

Parquet file organisation for 100GB+ dataframes

2015-08-12 Thread Ewan Leith
Hi all, Can anyone share their experiences working with storing and organising larger datasets with Spark? I've got a dataframe stored in Parquet on Amazon S3 (using EMRFS) which has a fairly complex nested schema (based on JSON files), which I can query in Spark, but the initial setup takes

RE: Specifying the role when launching an AWS spark cluster using spark_ec2

2015-08-07 Thread Ewan Leith
You'll have a lot less hassle using the AWS EMR instances with Spark 1.4.1 for now, until the spark_ec2.py scripts move to Hadoop 2.7.1, at the moment I'm pretty sure it's only using Hadoop 2.4 The EMR setup with Spark lets you use s3:// URIs with IAM roles Ewan -Original Message-

RE: Help accessing protected S3

2015-07-23 Thread Ewan Leith
I think the standard S3 driver used in Spark from the Hadoop project (S3n) doesn't support IAM role based authentication. However, S3a should support it. If you're running Hadoop 2.6 via the spark-ec2 scripts (I'm not sure what it launches with by default) try accessing your bucket via s3a://

RE: coalesce on dataFrame

2015-07-01 Thread Ewan Leith
It's in spark 1.4.0, or should be at least: https://issues.apache.org/jira/browse/SPARK-6972 Ewan -Original Message- From: Hafiz Mujadid [mailto:hafizmujadi...@gmail.com] Sent: 01 July 2015 08:23 To: user@spark.apache.org Subject: coalesce on dataFrame How can we use coalesce(1, true)

[jira] [Created] (SPARK-8437) Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles

2015-06-18 Thread Ewan Leith (JIRA)
Ewan Leith created SPARK-8437: - Summary: Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles Key: SPARK-8437 URL: https://issues.apache.org/jira

[jira] [Commented] (SPARK-8437) Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles

2015-06-18 Thread Ewan Leith (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591615#comment-14591615 ] Ewan Leith commented on SPARK-8437: --- Thanks, I wasn't sure if it was Hadoop or Spark

[jira] [Comment Edited] (SPARK-8437) Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles

2015-06-18 Thread Ewan Leith (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591615#comment-14591615 ] Ewan Leith edited comment on SPARK-8437 at 6/18/15 10:51 AM

RE: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Ewan Leith
Try putting a * on the end of xmlDir, i.e. xmlDir = fdfs:///abc/def/* Rather than xmlDir = Hdfs://abc/def and see what happens. I don't know why, but that appears to be more reliable for me with S3 as the filesystem. I'm also using binaryFiles, but I've tried running the same command while

RE: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Ewan Leith
Can you do a simple sc.binaryFiles(hdfs:///path/to/files/*).count() in the spark-shell and verify that part works? Ewan -Original Message- From: Konstantinos Kougios [mailto:kostas.koug...@googlemail.com] Sent: 08 June 2015 15:40 To: Ewan Leith; user@spark.apache.org Subject: Re

RE: redshift spark

2015-06-05 Thread Ewan Leith
That project is for reading data in from Redshift table exports stored in s3 by running commands in redshift like this: unload ('select * from venue') to 's3://mybucket/tickit/unload/' http://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html The path in the parameters below is

AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Ewan Leith
Hi all, I might be missing something, but does the new Spark 1.3 sqlContext save interface support using Avro as the schema structure when writing Parquet files, in a similar way to AvroParquetWriter (which I've got working)? I've seen how you can load an avro file and save it as parquet from

RE: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Ewan Leith
Thanks Cheng, that's brilliant, you've saved me a headache. Ewan From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: 19 May 2015 11:58 To: Ewan Leith; user@spark.apache.org Subject: Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces? That's right

RE: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Ewan Leith
Lian [mailto:lian.cs@gmail.com] Sent: 19 May 2015 11:01 To: Ewan Leith; user@spark.apache.org Subject: Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces? Hi Ewan, Different from AvroParquetWriter, in Spark SQL we uses StructType as the intermediate

Re: [nodejs] Unable to compile node v0.8 RC7 on ARM (Beaglebone)

2012-06-22 Thread Ewan Leith
any additional compiler flags and see what happens. Thanks, Ewan On Friday, 22 June 2012 14:51:29 UTC+1, Ben Noordhuis wrote: On Fri, Jun 22, 2012 at 3:39 PM, Ewan Leith wrote: Hi all, I'm trying to compile node v0.8 rc7 on my beaglebone, and the configure script it falling over. It does

Re: [nodejs] Unable to compile node v0.8 RC7 on ARM (Beaglebone)

2012-06-22 Thread Ewan Leith
/beaglebone http://archlinuxarm.org/packages - search for armv7 and nodejs. Currently it has v0.6.19 http://archlinuxarm.org/developers/building-packages On Fri, Jun 22, 2012 at 9:06 AM, Ewan Leith wrote: Thanks Ben, I added armv7 then it started complaining about arm_neon, so I've added

Suffix authentication in users file

2002-11-25 Thread Ewan Leith
Ive just been trying to get freeradius working instead of citron radius, but I've ran into a problem with the suffix parameter setting in /etc/raddb/users. My understanding of the Suffix was that: DEFAULT Suffix == NC, Auth-Type := System Service-Type = Framed-User,

Re: Suffix authentication in users file

2002-11-25 Thread Ewan Leith
works perfectly thanks, obvious when you think about it i suppose :) Ewan Chris Parker wrote: Yes, so use the 'hints' file as the documentation at the beginning of the hints file tells you how to do exactly what you are looking for. -Chris -- - List info/subscribe/unsubscribe? See

[ADMIN] Moving a database

2001-12-10 Thread Ewan Leith
Hi all, we recently upgraded from 6.53 to 7.1.2, and at the same time tried to move the database to a new filesystem. However, while the upgrade was 100% successful using pg_dumpall, we see that postgres is still reading some files from the old file system (though only updating the new files).