Any way for users to help "stuck" JIRAs with pull requests for Spark 2.3 / future releases?

2017-12-21 Thread Ewan Leith
Hi all,

I was wondering with the approach of Spark 2.3 if there's any way us "regular" 
users can help advance any of JIRAs that could have made it into Spark 2.3 but 
are likely to miss now as the pull requests are awaiting detailed review.

For example:

https://issues.apache.org/jira/browse/SPARK-4502 - Spark SQL reads unneccesary 
nested fields from Parquet

Has a pull request from January 2017 with significant performance benefits for 
parquet reads.

https://issues.apache.org/jira/browse/SPARK-21657 - Spark has exponential time 
complexity to explode(array of structs)

Probably affects fewer users, but will be a real help for those users.

Both of these example tickets probably need more testing, but without them 
getting merged into the master branch and included in a release with a default 
config setting disabling them, the testing will be pretty limited.

Is there anything us users can do to help out with these kind of tickets, or do 
they need to wait for some additional core developer time to free up (I know 
that's in huge demand everywhere in the project!).

Thanks,
Ewan






This email and any attachments to it may contain confidential information and 
are intended solely for the addressee.



If you are not the intended recipient of this email or if you believe you have 
received this email in error, please contact the sender and remove it from your 
system.Do not use, copy or disclose the information contained in this email or 
in any attachment.

RealityMine Limited may monitor email traffic data including the content of 
email for the purposes of security.

RealityMine Limited is a company registered in England and Wales. Registered 
number: 07920936 Registered office: Warren Bruce Court, Warren Bruce Road, 
Trafford Park, Manchester M17 1LB


RE: SparkUI via proxy

2016-11-25 Thread Ewan Leith
This is more of a question for the spark user’s list, but if you look at 
FoxyProxy and SSH tunnels it’ll get you going.

These instructions from AWS for accessing EMR are a good start

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-proxy.html

Ewan

From: Georg Heiler [mailto:georg.kf.hei...@gmail.com]
Sent: 24 November 2016 16:41
To: marco rocchi ; dev@spark.apache.org
Subject: Re: SparkUI via proxy

Sehr Port forwarding will help you out.
marco rocchi 
>
 schrieb am Do. 24. Nov. 2016 um 16:33:
Hi,
I'm working with Apache Spark in order to develop my master thesis.I'm new in 
spark and working with cluster. I searched through internet but I didn't found 
a way to solve.
My problem is the following one: from my pc I can access to a master node of a 
cluster only via proxy.
To connect to proxy and then to master node,I have to set up an ssh tunnel, but 
from parctical point of view I have no idea of how in this way I can interact 
with WebUI spark.
Anyone can help me?
Thanks in advance



This email and any attachments to it may contain confidential information and 
are intended solely for the addressee.



If you are not the intended recipient of this email or if you believe you have 
received this email in error, please contact the sender and remove it from your 
system.Do not use, copy or disclose the information contained in this email or 
in any attachment.

RealityMine Limited may monitor email traffic data including the content of 
email for the purposes of security.

RealityMine Limited is a company registered in England and Wales. Registered 
number: 07920936 Registered office: Warren Bruce Court, Warren Bruce Road, 
Trafford Park, Manchester M17 1LB


[Bug 1627769] Re: limits.conf not applied

2016-11-18 Thread Ewan Leith
I've found that editting /etc/systemd/user.conf and inserting lines such
as

DefaultLimitNOFILE=4

changes the ulimit for the graphical user sessions

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1627769

Title:
  limits.conf not applied

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/lightdm/+bug/1627769/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Desktop-packages] [Bug 1627769] Re: limits.conf not applied

2016-11-18 Thread Ewan Leith
I've found that editting /etc/systemd/user.conf and inserting lines such
as

DefaultLimitNOFILE=4

changes the ulimit for the graphical user sessions

-- 
You received this bug notification because you are a member of Desktop
Packages, which is subscribed to lightdm in Ubuntu.
https://bugs.launchpad.net/bugs/1627769

Title:
  limits.conf not applied

Status in lightdm package in Ubuntu:
  Confirmed

Bug description:
  Since upgraded to 16.10 Yakkety, modifications in
  /etc/security/limits.conf are not taken into consideration when
  logging in the graphical interface.

  
  /etc/security/limits.conf:
  @audio   -  rtprio 99
  @audio   -  memlockunlimited

  I tried the same settings in /etc/security/limits.d/audio.conf, to the
  same results.

  
  After logging in Unity, opening a console, the limits are not set:
  blablack@ideaon:~$ ulimit -l -r
  max locked memory   (kbytes, -l) 64
  real-time priority  (-r) 0

  
  Reloging to my user via bash DOES apply the limits:
  blablack@ideaon:~$ ulimit -l -r
  max locked memory   (kbytes, -l) 64
  real-time priority  (-r) 0
  blablack@ideaon:~$ su blablack
  Password: 
  blablack@ideaon:~$ ulimit -l -r
  max locked memory   (kbytes, -l) unlimited
  real-time priority  (-r) 95

  
  Switching to a console (ctrl+alt+f1) and logging in would apply the limits as 
well.



  The exact same setup used to work fine on Xenial 16.04 before upgrade.

  
  If you need any more information, please let me know.

  ProblemType: Bug
  DistroRelease: Ubuntu 16.10
  Package: lightdm 1.19.4-0ubuntu1
  ProcVersionSignature: Ubuntu 4.8.0-16.17-generic 4.8.0-rc7
  Uname: Linux 4.8.0-16-generic x86_64
  ApportVersion: 2.20.3-0ubuntu7
  Architecture: amd64
  CurrentDesktop: Unity
  Date: Mon Sep 26 17:27:10 2016
  SourcePackage: lightdm
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/lightdm/+bug/1627769/+subscriptions

-- 
Mailing list: https://launchpad.net/~desktop-packages
Post to : desktop-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~desktop-packages
More help   : https://help.launchpad.net/ListHelp


[Touch-packages] [Bug 1627769] Re: limits.conf not applied

2016-11-18 Thread Ewan Leith
I've found that editting /etc/systemd/user.conf and inserting lines such
as

DefaultLimitNOFILE=4

changes the ulimit for the graphical user sessions

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lightdm in Ubuntu.
https://bugs.launchpad.net/bugs/1627769

Title:
  limits.conf not applied

Status in lightdm package in Ubuntu:
  Confirmed

Bug description:
  Since upgraded to 16.10 Yakkety, modifications in
  /etc/security/limits.conf are not taken into consideration when
  logging in the graphical interface.

  
  /etc/security/limits.conf:
  @audio   -  rtprio 99
  @audio   -  memlockunlimited

  I tried the same settings in /etc/security/limits.d/audio.conf, to the
  same results.

  
  After logging in Unity, opening a console, the limits are not set:
  blablack@ideaon:~$ ulimit -l -r
  max locked memory   (kbytes, -l) 64
  real-time priority  (-r) 0

  
  Reloging to my user via bash DOES apply the limits:
  blablack@ideaon:~$ ulimit -l -r
  max locked memory   (kbytes, -l) 64
  real-time priority  (-r) 0
  blablack@ideaon:~$ su blablack
  Password: 
  blablack@ideaon:~$ ulimit -l -r
  max locked memory   (kbytes, -l) unlimited
  real-time priority  (-r) 95

  
  Switching to a console (ctrl+alt+f1) and logging in would apply the limits as 
well.



  The exact same setup used to work fine on Xenial 16.04 before upgrade.

  
  If you need any more information, please let me know.

  ProblemType: Bug
  DistroRelease: Ubuntu 16.10
  Package: lightdm 1.19.4-0ubuntu1
  ProcVersionSignature: Ubuntu 4.8.0-16.17-generic 4.8.0-rc7
  Uname: Linux 4.8.0-16-generic x86_64
  ApportVersion: 2.20.3-0ubuntu7
  Architecture: amd64
  CurrentDesktop: Unity
  Date: Mon Sep 26 17:27:10 2016
  SourcePackage: lightdm
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/lightdm/+bug/1627769/+subscriptions

-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp


[Touch-packages] [Bug 1627769] Re: limits.conf not applied

2016-11-15 Thread Ewan Leith
Still seems to be an issue for me, logging into the terminal I have the
modified ulimit values, but not in a lightdm session

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lightdm in Ubuntu.
https://bugs.launchpad.net/bugs/1627769

Title:
  limits.conf not applied

Status in lightdm package in Ubuntu:
  Confirmed

Bug description:
  Since upgraded to 16.10 Yakkety, modifications in
  /etc/security/limits.conf are not taken into consideration when
  logging in the graphical interface.

  
  /etc/security/limits.conf:
  @audio   -  rtprio 99
  @audio   -  memlockunlimited

  I tried the same settings in /etc/security/limits.d/audio.conf, to the
  same results.

  
  After logging in Unity, opening a console, the limits are not set:
  blablack@ideaon:~$ ulimit -l -r
  max locked memory   (kbytes, -l) 64
  real-time priority  (-r) 0

  
  Reloging to my user via bash DOES apply the limits:
  blablack@ideaon:~$ ulimit -l -r
  max locked memory   (kbytes, -l) 64
  real-time priority  (-r) 0
  blablack@ideaon:~$ su blablack
  Password: 
  blablack@ideaon:~$ ulimit -l -r
  max locked memory   (kbytes, -l) unlimited
  real-time priority  (-r) 95

  
  Switching to a console (ctrl+alt+f1) and logging in would apply the limits as 
well.



  The exact same setup used to work fine on Xenial 16.04 before upgrade.

  
  If you need any more information, please let me know.

  ProblemType: Bug
  DistroRelease: Ubuntu 16.10
  Package: lightdm 1.19.4-0ubuntu1
  ProcVersionSignature: Ubuntu 4.8.0-16.17-generic 4.8.0-rc7
  Uname: Linux 4.8.0-16-generic x86_64
  ApportVersion: 2.20.3-0ubuntu7
  Architecture: amd64
  CurrentDesktop: Unity
  Date: Mon Sep 26 17:27:10 2016
  SourcePackage: lightdm
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/lightdm/+bug/1627769/+subscriptions

-- 
Mailing list: https://launchpad.net/~touch-packages
Post to : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp


[Bug 1627769] Re: limits.conf not applied

2016-11-15 Thread Ewan Leith
Still seems to be an issue for me, logging into the terminal I have the
modified ulimit values, but not in a lightdm session

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1627769

Title:
  limits.conf not applied

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/lightdm/+bug/1627769/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Desktop-packages] [Bug 1627769] Re: limits.conf not applied

2016-11-15 Thread Ewan Leith
Still seems to be an issue for me, logging into the terminal I have the
modified ulimit values, but not in a lightdm session

-- 
You received this bug notification because you are a member of Desktop
Packages, which is subscribed to lightdm in Ubuntu.
https://bugs.launchpad.net/bugs/1627769

Title:
  limits.conf not applied

Status in lightdm package in Ubuntu:
  Confirmed

Bug description:
  Since upgraded to 16.10 Yakkety, modifications in
  /etc/security/limits.conf are not taken into consideration when
  logging in the graphical interface.

  
  /etc/security/limits.conf:
  @audio   -  rtprio 99
  @audio   -  memlockunlimited

  I tried the same settings in /etc/security/limits.d/audio.conf, to the
  same results.

  
  After logging in Unity, opening a console, the limits are not set:
  blablack@ideaon:~$ ulimit -l -r
  max locked memory   (kbytes, -l) 64
  real-time priority  (-r) 0

  
  Reloging to my user via bash DOES apply the limits:
  blablack@ideaon:~$ ulimit -l -r
  max locked memory   (kbytes, -l) 64
  real-time priority  (-r) 0
  blablack@ideaon:~$ su blablack
  Password: 
  blablack@ideaon:~$ ulimit -l -r
  max locked memory   (kbytes, -l) unlimited
  real-time priority  (-r) 95

  
  Switching to a console (ctrl+alt+f1) and logging in would apply the limits as 
well.



  The exact same setup used to work fine on Xenial 16.04 before upgrade.

  
  If you need any more information, please let me know.

  ProblemType: Bug
  DistroRelease: Ubuntu 16.10
  Package: lightdm 1.19.4-0ubuntu1
  ProcVersionSignature: Ubuntu 4.8.0-16.17-generic 4.8.0-rc7
  Uname: Linux 4.8.0-16-generic x86_64
  ApportVersion: 2.20.3-0ubuntu7
  Architecture: amd64
  CurrentDesktop: Unity
  Date: Mon Sep 26 17:27:10 2016
  SourcePackage: lightdm
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/lightdm/+bug/1627769/+subscriptions

-- 
Mailing list: https://launchpad.net/~desktop-packages
Post to : desktop-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~desktop-packages
More help   : https://help.launchpad.net/ListHelp


Re: Spark 2.0.1 release?

2016-09-16 Thread Ewan Leith
That's great news, since it's that close I'll get started on building and 
testing the branch myself

Thanks,
Ewan

On 16 Sep 2016 19:23, Reynold Xin <r...@databricks.com> wrote:
2.0.1 is definitely coming soon.  Was going to tag a rc yesterday but ran into 
some issue. I will try to do it early next week for rc.


On Fri, Sep 16, 2016 at 11:16 AM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:

Hi all,

Apologies if I've missed anything, but is there likely to see a 2.0.1 bug fix 
release, or does a jump to 2.1.0 with additional features seem more probable?

The issues for 2.0.1 seem pretty much done here 
https://issues.apache.org/jira/browse/SPARK/fixforversion/12336857/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel

And there's a lot of good bugfixes in there that I'd love to be able to use 
without deploying our own builds.

Thanks,
Ewan



This email and any attachments to it may contain confidential information and 
are intended solely for the addressee.



If you are not the intended recipient of this email or if you believe you have 
received this email in error, please contact the sender and remove it from your 
system.Do not use, copy or disclose the information contained in this email or 
in any attachment.

RealityMine Limited may monitor email traffic data including the content of 
email for the purposes of security.

RealityMine Limited is a company registered in England and Wales. Registered 
number: 07920936 Registered office: Warren Bruce Court, Warren Bruce Road, 
Trafford Park, Manchester M17 1LB





This email and any attachments to it may contain confidential information and 
are intended solely for the addressee.



If you are not the intended recipient of this email or if you believe you have 
received this email in error, please contact the sender and remove it from your 
system.Do not use, copy or disclose the information contained in this email or 
in any attachment.

RealityMine Limited may monitor email traffic data including the content of 
email for the purposes of security.

RealityMine Limited is a company registered in England and Wales. Registered 
number: 07920936 Registered office: Warren Bruce Court, Warren Bruce Road, 
Trafford Park, Manchester M17 1LB


Spark 2.0.1 release?

2016-09-16 Thread Ewan Leith
Hi all,

Apologies if I've missed anything, but is there likely to see a 2.0.1 bug fix 
release, or does a jump to 2.1.0 with additional features seem more probable?

The issues for 2.0.1 seem pretty much done here 
https://issues.apache.org/jira/browse/SPARK/fixforversion/12336857/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel

And there's a lot of good bugfixes in there that I'd love to be able to use 
without deploying our own builds.

Thanks,
Ewan



This email and any attachments to it may contain confidential information and 
are intended solely for the addressee.



If you are not the intended recipient of this email or if you believe you have 
received this email in error, please contact the sender and remove it from your 
system.Do not use, copy or disclose the information contained in this email or 
in any attachment.

RealityMine Limited may monitor email traffic data including the content of 
email for the purposes of security.

RealityMine Limited is a company registered in England and Wales. Registered 
number: 07920936 Registered office: Warren Bruce Court, Warren Bruce Road, 
Trafford Park, Manchester M17 1LB


[jira] [Commented] (SPARK-13721) Add support for LATERAL VIEW OUTER explode()

2016-09-05 Thread Ewan Leith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464427#comment-15464427
 ] 

Ewan Leith commented on SPARK-13721:


Assuming Don's use case is the same as ours, we have to do odd looking queries 
like this pseudo-code to get the full set of entries when using explode with 
records where the nested array is not always populated (with the .filter's to 
make it explicit what's happening):

val df1 = df
  .filter("column.nested_array is not null")
  .withColumn("element", explode(col("column.nested_array")))
  .select("other_column", "element")

val df2 = df
  .filter("column.nested_array is null")
  .select("other_column", lit("") as "element")

df1.unionAll(df2)



> Add support for LATERAL VIEW OUTER explode()
> 
>
> Key: SPARK-13721
> URL: https://issues.apache.org/jira/browse/SPARK-13721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Ian Hellstrom
>
> Hive supports the [LATERAL VIEW 
> OUTER|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView#LanguageManualLateralView-OuterLateralViews]
>  syntax to make sure that when an array is empty, the content from the outer 
> table is still returned. 
> Within Spark, this is currently only possible within the HiveContext and 
> executing HiveQL statements. It would be nice if the standard explode() 
> DataFrame method allows the same. A possible signature would be: 
> {code:scala}
> explode[A, B](inputColumn: String, outputColumn: String, outer: Boolean = 
> false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17313) Support spark-shell on cluster mode

2016-08-30 Thread Ewan Leith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450194#comment-15450194
 ] 

Ewan Leith commented on SPARK-17313:


I think Apache Zeppelin and Spark Notebook both cover this requirement better 
than the Spark shell ever will? The installation requirements for either are 
fairly minimal and give you all sorts of additional benefits over the raw shell.

> Support spark-shell on cluster mode
> ---
>
> Key: SPARK-17313
> URL: https://issues.apache.org/jira/browse/SPARK-17313
> Project: Spark
>  Issue Type: New Feature
>Reporter: Mahmoud Elgamal
>
> The main issue with the current spark shell is that the driver is running on 
> the user machine. If the driver resource requirement is beyond user machine 
> capacity, then spark shell will be useless. If we are to add the cluster 
> mode(Yarn or Mesos ) for spark shell via some sort of proxy where user 
> machine only hosts a rest client to the running driver at the cluster, the 
> shell will be more powerful



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17099) Incorrect result when HAVING clause is added to group by query

2016-08-17 Thread Ewan Leith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15424277#comment-15424277
 ] 

Ewan Leith commented on SPARK-17099:


I've done a quick test in Spark 1.6.1 and this produces the expected 4 rows 
with the same output above, so I this appears to be a regression

> Incorrect result when HAVING clause is added to group by query
> --
>
> Key: SPARK-17099
> URL: https://issues.apache.org/jira/browse/SPARK-17099
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Critical
> Fix For: 2.1.0
>
>
> Random query generation uncovered the following query which returns incorrect 
> results when run on Spark SQL. This wasn't the original query uncovered by 
> the generator, since I performed a bit of minimization to try to make it more 
> understandable.
> With the following tables:
> {code}
> val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5")
> val t2 = sc.parallelize(
>   Seq(
> (-769, -244),
> (-800, -409),
> (940, 86),
> (-507, 304),
> (-367, 158))
> ).toDF("int_col_2", "int_col_5")
> t1.registerTempTable("t1")
> t2.registerTempTable("t2")
> {code}
> Run
> {code}
> SELECT
>   (SUM(COALESCE(t1.int_col_5, t2.int_col_2))),
>  ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
> FROM t1
> RIGHT JOIN t2
>   ON (t2.int_col_2) = (t1.int_col_5)
> GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)),
>  COALESCE(t1.int_col_5, t2.int_col_2)
> HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, 
> t2.int_col_2)) * 2)
> {code}
> In Spark SQL, this returns an empty result set, whereas Postgres returns four 
> rows. However, if I omit the {{HAVING}} clause I see that the group's rows 
> are being incorrectly filtered by the {{HAVING}} clause:
> {code}
> +--+---+--+
> | sum(coalesce(int_col_5, int_col_2))  | (coalesce(int_col_5, int_col_2) * 2) 
>  |
> +--+---+--+
> | -507 | -1014
>  |
> | 940  | 1880 
>  |
> | -769 | -1538
>  |
> | -367 | -734 
>  |
> | -800 | -1600
>  |
> +--+---+--+
> {code}
> Based on this, the output after adding the {{HAVING}} should contain four 
> rows, not zero.
> I'm not sure how to further shrink this in a straightforward way, so I'm 
> opening this bug to get help in triaging further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: How to resolve the SparkExecption : Size exceeds Integer.MAX_VALUE

2016-08-15 Thread Ewan Leith
I think this is more suited to the user mailing list than the dev one, but this 
almost always means you need to repartition your data into smaller partitions 
as one of the partitions is over 2GB.

When you create your dataset, put something like . repartition(1000) at the end 
of the command creating the initial dataframe or dataset.

Ewan

On 15 Aug 2016 17:46, Minudika Malshan  wrote:
Hi all,

I am trying to create and train a model for a Kaggle competition dataset using 
Apache spark. The dataset has more than 10 million rows of data.
But when training the model, I get an exception "Size exceeds 
Integer.MAX_VALUE".

I found the same question has been raised in Stack overflow but those answers 
didn't help much.

It would be a great if you could help to resolve this issue.

Thanks.
Minudika


This email and any attachments to it may contain confidential information and 
are intended solely for the addressee and. If you are not the intended 
recipient of this email or if you have believe you have received this email in 
error, please contact the sender and remove it from your system. Do not use, 
copy or disclose the information contained in this email or in any attachment. 
RealityMine Limited may monitor email traffic data. RealityMine Limited may 
monitor email traffic data and also the content of email for the purposes of 
security. RealityMine Limited is a company registered in England and Wales. 
Registered number: 07920936 Registered office: Warren Bruce Court, Warren Bruce 
Road, Trafford Park, Manchester M17 1LB


Re: zip for pyspark

2016-08-08 Thread Ewan Leith
If you build a normal python egg file with the dependencies, you can execute 
that like you are executing a .py file with  --py-files

Thanks,
Ewan

On 8 Aug 2016 3:44 p.m., pseudo oduesp  wrote:
hi,
how i can export all project on pyspark like zip   from local session to 
cluster and deploy with spark submit  i mean i have a large project with all 
dependances and i want create zip containing all of dependecs and deploy it on 
cluster



Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-07 Thread Ewan Leith
Looking at the encoders api documentation at

http://spark.apache.org/docs/latest/api/java/

== Java == Encoders are specified by calling static methods on 
Encoders.

List data = Arrays.asList("abc", "abc", "xyz"); Dataset ds = 
context.createDataset(data, Encoders.STRING());

I think you should be calling

.as((Encoders.STRING(), Encoders.STRING()))

or similar

Ewan

On 8 Aug 2016 06:10, Aseem Bansal  wrote:
Hi All

Has anyone done this with Java API?

On Fri, Aug 5, 2016 at 5:36 PM, Aseem Bansal 
> wrote:
I need to use few columns out of a csv. But as there is no option to read few 
columns out of csv so
 1. I am reading the whole CSV using SparkSession.csv()
 2.  selecting few of the columns using DataFrame.select()
 3. applying schema using the .as() function of Dataset.  I tried to 
extent org.apache.spark.sql.Encoder as the input for as function

But I am getting the following exception

Exception in thread "main" java.lang.RuntimeException: Only expression encoders 
are supported today

So my questions are -
1. Is it possible to read few columns instead of whole CSV? I cannot change the 
CSV as that is upstream data
2. How do I apply schema to few columns if I cannot write my encoder?




RE: how to save spark files as parquets efficiently

2016-07-29 Thread Ewan Leith
If you replace the df.write ….

With

df.count()

in your code you’ll see how much time is taken to process the full execution 
plan without the write output.

That code below looks perfectly normal for writing a parquet file yes, there 
shouldn’t be any tuning needed for “normal” performance.

Thanks,
Ewan

From: Sumit Khanna [mailto:sumit.kha...@askme.in]
Sent: 29 July 2016 13:41
To: Gourav Sengupta 
Cc: user 
Subject: Re: how to save spark files as parquets efficiently

Hey Gourav,

Well so I think that it is my execution plan that is at fault. So basically 
df.write as a spark job on localhost:4040/ well being an action will include 
the time taken for all the umpteen transformation on it right? All I wanted to 
know is "what apt env/config params are needed to something simple read a 
dataframe from parquet and save it back as another parquet (meaning vanilla 
load/store no transformation). Is it good enough to simply read. and write. in 
the very format mentioned in spark tutorial docs i.e

df.write.format("parquet").mode("overwrite").save(hdfspathTemp) ??

Thanks,

On Fri, Jul 29, 2016 at 4:22 PM, Gourav Sengupta 
> wrote:
Hi,

The default write format in SPARK is parquet. And I have never faced any issues 
writing over a billion records in SPARK. Are you using virtualization by any 
chance or an obsolete hard disk or Intel Celeron may be?
Regards,
Gourav Sengupta

On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna 
> wrote:
Hey,

master=yarn
mode=cluster

spark.executor.memory=8g
spark.rpc.netty.dispatcher.numThreads=2

All the POC on a single node cluster. the biggest bottle neck being :

1.8 hrs to save 500k records as a parquet file/dir executing this command :


df.write.format("parquet").mode("overwrite").save(hdfspathTemp)

No doubt, the whole execution plan gets triggered on this write / save action. 
But is it the right command / set of params to save a dataframe?

essentially I am doing an upsert by pulling in data from hdfs and then updating 
it with the delta changes of the current run. But not sure if write itself 
takes that much time or some optimization is needed for upsert. (I have that 
asked as another question altogether).

Thanks,
Sumit





Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-23 Thread Ewan Leith
Ok cool, i didn't vote as I've done no real testing myself and i think the 
window had already closed anyway.

I'm happy to wait for 2.0.1 for our systems.

Thanks,
Ewan

On 23 Jul 2016 07:07, Reynold Xin <r...@databricks.com> wrote:
Ewan not sure if you wanted to explicitly -1 so I didn’t include you in that.

I will document this as a known issue in the release notes. We have other bugs 
that we have fixed since RC5, and we can fix those together in 2.0.1.


On July 22, 2016 at 10:24:32 PM, Ewan Leith 
(ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>) wrote:

I think this new issue in JIRA blocks the release unfortunately?

https://issues.apache.org/jira/browse/SPARK-16664 - Persist call on data frames 
with more than 200 columns is wiping out the data

Otherwise there'll need to be 2.0.1 pretty much right after?

Thanks,
Ewan

On 23 Jul 2016 03:46, Xiao Li 
<gatorsm...@gmail.com<mailto:gatorsm...@gmail.com>> wrote:
+1

2016-07-22 19:32 GMT-07:00 Kousuke Saruta 
<saru...@oss.nttdata.co.jp<mailto:saru...@oss.nttdata.co.jp>>:

+1 (non-binding)

Tested on my cluster with three slave nodes.


On 2016/07/23 10:25, Suresh Thalamati wrote:
+1 (non-binding)

Tested data source api , and jdbc data sources.


On Jul 19, 2016, at 7:35 PM, Reynold Xin 
<r...@databricks.com<mailto:r...@databricks.com>> wrote:

Please vote on releasing the following candidate as Apache Spark version 2.0.0. 
The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: 
https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-bin/<http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.0-rc5-bin/>

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1195/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/<http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.0-rc5-docs/>


=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions from 1.x.

==
What justifies a -1 vote for this release?
==
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features 
will not necessarily block this release. Note that historically Spark 
documentation has been published on the website separately from the main 
release so we do not need to block the release due to documentation errors 
either.








Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-22 Thread Ewan Leith
I think this new issue in JIRA blocks the release unfortunately?

https://issues.apache.org/jira/browse/SPARK-16664 - Persist call on data frames 
with more than 200 columns is wiping out the data

Otherwise there'll need to be 2.0.1 pretty much right after?

Thanks,
Ewan

On 23 Jul 2016 03:46, Xiao Li  wrote:
+1

2016-07-22 19:32 GMT-07:00 Kousuke Saruta 
>:

+1 (non-binding)

Tested on my cluster with three slave nodes.


On 2016/07/23 10:25, Suresh Thalamati wrote:
+1 (non-binding)

Tested data source api , and jdbc data sources.


On Jul 19, 2016, at 7:35 PM, Reynold Xin 
> wrote:

Please vote on releasing the following candidate as Apache Spark version 2.0.0. 
The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc5 (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).

This release candidate resolves ~2500 issues: 
https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1195/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/


=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions from 1.x.

==
What justifies a -1 vote for this release?
==
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features 
will not necessarily block this release. Note that historically Spark 
documentation has been published on the website separately from the main 
release so we do not need to block the release due to documentation errors 
either.







RE: Role-based S3 access outside of EMR

2016-07-21 Thread Ewan Leith
If you use S3A rather than S3N, it supports IAM roles.

I think you can make s3a used for s3:// style URLs so it’s consistent with your 
EMR paths by adding this to your Hadoop config, probably in core-site.xml:

fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A
fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A

And make sure the s3a jars are in your classpath

Thanks,
Ewan

From: Everett Anderson [mailto:ever...@nuna.com.INVALID]
Sent: 21 July 2016 17:01
To: Gourav Sengupta 
Cc: Teng Qiu ; Andy Davidson 
; user 
Subject: Re: Role-based S3 access outside of EMR

Hey,

FWIW, we are using EMR, actually, in production.

The main case I have for wanting to access S3 with Spark outside of EMR is that 
during development, our developers tend to run EC2 sandbox instances that have 
all the rest of our code and access to some of the input data on S3. It'd be 
nice if S3 access "just worked" on these without storing the access keys in an 
exposed manner.

Teng -- when you say you use EMRFS, does that mean you copied AWS's EMRFS JAR 
from an EMR cluster and are using it outside? My impression is that AWS hasn't 
released the EMRFS implementation as part of the aws-java-sdk, so I'm wary of 
using it. Do you know if it's supported?


On Thu, Jul 21, 2016 at 2:32 AM, Gourav Sengupta 
> wrote:
Hi Teng,
This is totally a flashing news for me, that people cannot use EMR in 
production because its not open sourced, I think that even Werner is not aware 
of such a problem. Is EMRFS opensourced? I am curious to know what does HA 
stand for?
Regards,
Gourav

On Thu, Jul 21, 2016 at 8:37 AM, Teng Qiu 
> wrote:
there are several reasons that AWS users do (can) not use EMR, one
point for us is that security compliance problem, EMR is totally not
open sourced, we can not use it in production system. second is that
EMR do not support HA yet.

but to the original question from @Everett :

-> Credentials and Hadoop Configuration

as you said, best practice should be "rely on machine roles", they
called IAM roles.

we are using EMRFS impl for accessing s3, it supports IAM role-based
access control well. you can take a look here:
https://github.com/zalando/spark/tree/branch-1.6-zalando

or simply use our docker image (Dockerfile on github:
https://github.com/zalando/spark-appliance/tree/master/Dockerfile)

docker run -d --net=host \
   -e START_MASTER="true" \
   -e START_WORKER="true" \
   -e START_WEBAPP="true" \
   -e START_NOTEBOOK="true" \
   
registry.opensource.zalan.do/bi/spark:1.6.2-6


-> SDK and File System Dependencies

as mentioned above, using EMRFS libs solved this problem:
http://docs.aws.amazon.com//ElasticMapReduce/latest/ReleaseGuide/emr-fs.html


2016-07-21 8:37 GMT+02:00 Gourav Sengupta 
>:
> But that would mean you would be accessing data over internet increasing
> data read latency, data transmission failures. Why are you not using EMR?
>
> Regards,
> Gourav
>
> On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson 
> >
> wrote:
>>
>> Thanks, Andy.
>>
>> I am indeed often doing something similar, now -- copying data locally
>> rather than dealing with the S3 impl selection and AWS credentials issues.
>> It'd be nice if it worked a little easier out of the box, though!
>>
>>
>> On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson
>> > wrote:
>>>
>>> Hi Everett
>>>
>>> I always do my initial data exploration and all our product development
>>> in my local dev env. I typically select a small data set and copy it to my
>>> local machine
>>>
>>> My main() has an optional command line argument ‘- - runLocal’ Normally I
>>> load data from either hdfs:/// or S3n:// . If the arg is set I read from
>>> file:///
>>>
>>> Sometime I use a CLI arg ‘- -dataFileURL’
>>>
>>> So in your case I would log into my data cluster and use “AWS s3 cp" to
>>> copy the data into my cluster and then use “SCP” to copy the data from the
>>> data center back to my local env.
>>>
>>> Andy
>>>
>>> From: Everett Anderson 
>>> >
>>> Date: Tuesday, July 19, 2016 at 2:30 PM
>>> To: "user @spark" >
>>> Subject: Role-based S3 access outside of EMR
>>>
>>> Hi,
>>>
>>> When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop
>>> FileSystem implementation for s3:// URLs and 

RE: is dataframe.write() async? Streaming performance problem

2016-07-08 Thread Ewan Leith
Writing (or reading) small files from spark to s3 can be seriously slow.

You'll get much higher throughput by doing a df.foreachPartition(partition => 
...) and inside each partition, creating an aws s3 client then doing a 
partition.foreach and uploading the files using that s3 client with its own 
threadpool.

As long as you create the s3 client inside the foreachPartition, and close it 
after the partition.foreach(...) is done, you shouldn't have any issues.

Something roughly like this from the DStream docs:

  df.foreachPartition { partitionOfRecords =>
val connection = createNewConnection()
partitionOfRecords.foreach(record => connection.send(record))
connection.close()
  }

Hope this helps,
Ewan

-Original Message-
From: Cody Koeninger [mailto:c...@koeninger.org] 
Sent: 08 July 2016 15:31
To: Andy Davidson 
Cc: user @spark 
Subject: Re: is dataframe.write() async? Streaming performance problem

Maybe obvious, but what happens when you change the s3 write to a println of 
all the data?  That should identify whether it's the issue.

count() and read.json() will involve additional tasks (run through the items in 
the rdd to count them, likewise to infer the schema) but for
300 records that shouldn't be much of an issue.

On Thu, Jul 7, 2016 at 3:59 PM, Andy Davidson  
wrote:
> I am running Spark 1.6.1 built for Hadoop 2.0.0-mr1-cdh4.2.0 and using 
> kafka direct stream approach. I am running into performance problems. 
> My processing time is > than my window size. Changing window sizes, 
> adding cores and executor memory does not change performance. I am 
> having a lot of trouble identifying the problem by at the metrics 
> provided for streaming apps in the spark application web UI.
>
> I think my performance problem has to with writing the data to S3.
>
> My app receives very complicated JSON. My program is simple, It sorts 
> the data into a small set of sets and writes each set as a separate S3 object.
> The mini batch data has at most 300 events so I do not think shuffle 
> is an issue.
>
> DataFrame rawDF = sqlContext.read().json(jsonRDD).cache();
>
> … Explode tagCol …
>
>
> DataFrame rulesDF = activityDF.select(tagCol).distinct();
>
> Row[] rows = rulesDF.select(tagCol).collect();
>
> List tags = new ArrayList(100);
>
> for (Row row : rows) {
>
> Object tag = row.get(0);
>
> tags.add(tag.toString());
>
> }
>
>
> I think the for loop bellow is where the bottle neck is. Is write async() ?
>
>
> If not is there an easy to to vectorize/parallelize this for loop or 
> do I have to create the threads my self?
>
>
> Is creating threads in spark a bad idea?
>
>
>
> for(String tag : tags) {
>
> DataFrame saveDF = 
> activityDF.filter(activityDF.col(tagCol).equalTo(tag));
>
> if (saveDF.count() >= 1) { // I do not think count() is an issue 
> performance is about 34 ms
>
> String dirPath = “s3n://myBucket" + File.separator + date + 
> File.separator + tag + File.separator +  milliSeconds;
>
> saveDF.write().json(dirPath);
>
> }
>
> }
>
>
> Any suggestions would be greatly appreciated
>
>
> Andy
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[jira] [Commented] (SPARK-16363) Spark-submit doesn't work with IAM Roles

2016-07-05 Thread Ewan Leith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363249#comment-15363249
 ] 

Ewan Leith commented on SPARK-16363:


I'm not sure this is a major issue, but try running with the filesystem path 
s3a:// as it looks like you're using the legacy Jets3t system which I'm sure 
doesnt support IAM roles

> Spark-submit doesn't work with IAM Roles
> 
>
> Key: SPARK-16363
> URL: https://issues.apache.org/jira/browse/SPARK-16363
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2
> Environment: Spark Stand-Alone with EC2 instances configured with IAM 
> Roles. 
>Reporter: Ashic Mahtab
>
> When running Spark Stand-alone in EC2 boxes, 
> spark-submit --master spark://master-ip:7077 --class Foo 
> --deploy-mode cluster --verbose s3://bucket/dir/foo/jar
> fails to find the jar even if AWS IAM roles are configured to allow the EC2 
> boxes (that are running Spark master, and workers) access to the file in S3. 
> The exception is provided below. It's asking us to set keys, etc. when the 
> boxes are configured via IAM roles. 
> 16/07/04 11:44:09 ERROR ClientEndpoint: Exception from cluster was: 
> java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
> must be specified as the username or password (respectively) of a s3 URL, or 
> by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties 
> (respectively).
> java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
> must be specified as the username or password (respectively) of a s3 URL, or 
> by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties 
> (respectively).
> at 
> org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
> at 
> org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
> at com.sun.proxy.$Proxy5.initialize(Unknown Source)
> at 
> org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:77)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
> at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1686)
> at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:598)
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:395)
> at 
> org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
> at 
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Spark SQL Nested Array of JSON with empty field

2016-06-05 Thread Ewan Leith
The spark json read is unforgiving of things like missing elements from some 
json records, or mixed types.

If you want to pass invalid json files through spark you're best doing an 
initial parse through the Jackson APIs using a defined schema first, then you 
can set types like Option[String] where a column is optional, then convert the 
validated back into a new string variable, then read the string as a dataframe.

Thanks,
Ewan

On 3 Jun 2016 22:03, Jerry Wong  wrote:
Hi,

I met a problem of empty field in the nested JSON file with Spark SQL. For 
instance,
There are two lines of JSON file as follows,

{
"firstname": "Jack",
"lastname": "Nelson",
"address": {
"state": "New York",
"city": "New York"
}
}{
"firstname": "Landy",
"middlename": "Ken",
"lastname": "Yong",
"address": {
"state": "California",
"city": "Los Angles"
}
}

I use Spark SQL to get the files like,
val row = sqlContext.sql("SELECT firstname, middlename, lastname, 
address.state, address.city FROM jsontable")
The compile will tell me the error of line1: no "middlename".
How do I handle this case in the SQL sql?

Many thanks in advance!
Jerry




RE: Timed aggregation in Spark

2016-05-23 Thread Ewan Leith
Rather than open a connection per record, if you do a DStream foreachRDD at the 
end of a 5 minute batch window

http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams

then you can do a rdd.foreachPartition to get the RDD partitions. Open a 
connection to vertica (or a pool of them) inside that mapPartitions, then do a 
partition.foreach to write each element from that partition to vertica, before 
finally closing the pool of connections.

Hope this helps,
Ewan

From: Nikhil Goyal [mailto:nownik...@gmail.com]
Sent: 23 May 2016 21:55
To: Ofir Kerker 
Cc: user@spark.apache.org
Subject: Re: Timed aggregation in Spark

I don't think this is solving the problem. So here are the issues:
1) How do we push entire data to vertica. Opening a connection per record will 
be too costly
2) If a key doesn't come again, how do we push this to vertica
3) How do we schedule the dumping of data to avoid loading too much data in 
state.



On Mon, May 23, 2016 at 1:33 PM, Ofir Kerker 
> wrote:
Yes, check out mapWithState:
https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html

_
From: Nikhil Goyal >
Sent: Monday, May 23, 2016 23:28
Subject: Timed aggregation in Spark
To: >


Hi all,

I want to aggregate my data for 5-10 min and then flush the aggregated data to 
some database like vertica. updateStateByKey is not exactly helpful in this 
scenario as I can't flush all the records at once, neither can I clear the 
state. I wanted to know if anyone else has faced a similar issue and how did 
they handle it.

Thanks
Nikhil




Spark Streaming - Exception thrown while writing record: BlockAdditionEvent

2016-05-23 Thread Ewan Leith
As we increase the throughput on our Spark streaming application, we're finding 
we hit errors with the WriteAheadLog, with errors like this:

16/05/21 20:42:21 WARN scheduler.ReceivedBlockTracker: Exception thrown while 
writing record: 
BlockAdditionEvent(ReceivedBlockInfo(0,Some(10),None,WriteAheadLogBasedStoreResult(input-0-1463850002991,Some(10),FileBasedWriteAheadLogSegment(hdfs://x.x.x.x:8020/checkpoint/receivedData/0/log-1463863286930-1463863346930,625283,39790
 to the WriteAheadLog.
java.util.concurrent.TimeoutException: Futures timed out after [5000 
milliseconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at 
org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:81)
 at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:232)
 at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:87)
 at 
org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:321)
 at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:500)
 at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229)
 at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:498)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
16/05/21 20:42:26 WARN scheduler.ReceivedBlockTracker: Exception thrown while 
writing record: 
BlockAdditionEvent(ReceivedBlockInfo(1,Some(10),None,WriteAheadLogBasedStoreResult(input-1-1462971836350,Some(10),FileBasedWriteAheadLogSegment(hdfs://x.x.x.x:8020/checkpoint/receivedData/1/log-1463863313080-1463863373080,455191,60798
 to the WriteAheadLog.

I've found someone else on StackOverflow with the same issue, who's suggested 
increasing the spark.streaming.driver.writeAheadLog.batchingTimeout setting, 
but we're not actually seeing significant performance issues on HDFS when the 
issue occurs.

http://stackoverflow.com/questions/34879092/reliability-issues-with-checkpointing-wal-in-spark-streaming-1-6-0

Has anyone else come across this, and any suggested areas we can look at?

Thanks,
Ewan


RE: Spark 1.6.0: substring on df.select

2016-05-12 Thread Ewan Leith
You could use a UDF pretty easily, something like this should work, the 
lastElement function could be changed to do pretty much any string manipulation 
you want.

import org.apache.spark.sql.functions.udf

def lastElement(input: String) = input.split("/").last

val lastElementUdf = udf(lastElement(_:String))

df.select(lastElementUdf ($"col1")).show()

Ewan


From: Bharathi Raja [mailto:raja...@yahoo.com.INVALID]
Sent: 12 May 2016 11:40
To: Raghavendra Pandey ; Bharathi Raja 

Cc: User 
Subject: RE: Spark 1.6.0: substring on df.select

Thanks Raghav.

I have 5+ million records. I feel creating multiple come is not an optimal way.

Please suggest any other alternate solution.
Can’t we do any string operation in DF.Select?

Regards,
Raja

From: Raghavendra Pandey
Sent: 11 May 2016 09:04 PM
To: Bharathi Raja
Cc: User
Subject: Re: Spark 1.6.0: substring on df.select


You can create a column with count of /.  Then take max of it and create that 
many columns for every row with null fillers.

Raghav
On 11 May 2016 20:37, "Bharathi Raja" 
> wrote:
Hi,

I have a dataframe column col1 with values something like 
“/client/service/version/method”. The number of “/” are not constant.
Could you please help me to extract all methods from the column col1?

In Pig i used SUBSTRING with LAST_INDEX_OF(“/”).

Thanks in advance.
Regards,
Raja



RE: Parse Json in Spark

2016-05-09 Thread Ewan Leith
The simplest way is probably to use the sc.binaryFiles or sc.wholeTextFiles API 
to create an RDD containing the JSON files (maybe need a 
sc.wholeTextFiles(…).map(x => x._2) to drop off the filename column) then do a 
sqlContext.read.json(rddName)

That way, you don’t need to worry about combining lines.

Ewan

From: KhajaAsmath Mohammed [mailto:mdkhajaasm...@gmail.com]
Sent: 08 May 2016 23:20
To: user @spark 
Subject: Parse Json in Spark

Hi,

I am working on parsing the json in spark but most of the information available 
online states that  I need to have entire JSON in single line.

In my case, Json file is delivered in complex structure and not in a single 
line. could anyone know how to process this in SPARK.

I used Jackson jar to process json and was able to do it when it is present in 
single line. Any ideas?

Thanks,
Asmath


RE: Spark streaming - update configuration while retaining write ahead log data?

2016-03-15 Thread Ewan Leith
That’s what I thought, it’s a shame!

Thanks Saisai,

Ewan

From: Saisai Shao [mailto:sai.sai.s...@gmail.com]
Sent: 15 March 2016 09:22
To: Ewan Leith <ewan.le...@realitymine.com>
Cc: user <user@spark.apache.org>
Subject: Re: Spark streaming - update configuration while retaining write ahead 
log data?

Currently configuration is a part of checkpoint data, and when recovering from 
failure, Spark Streaming will fetch the configuration from checkpoint data, so 
even if you change the configuration file, recovered Spark Streaming 
application will not use it. So from my understanding currently there's no way 
to handle your situation.

Thanks
Saisai

On Tue, Mar 15, 2016 at 5:12 PM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:
Has anyone seen a way of updating the Spark streaming job configuration while 
retaining the existing data in the write ahead log?

e.g. if you’ve launched a job without enough executors and a backlog has built 
up in the WAL, can you increase the number of executors without losing the WAL 
data?

Thanks,
Ewan



Spark streaming - update configuration while retaining write ahead log data?

2016-03-15 Thread Ewan Leith
Has anyone seen a way of updating the Spark streaming job configuration while 
retaining the existing data in the write ahead log?

e.g. if you've launched a job without enough executors and a backlog has built 
up in the WAL, can you increase the number of executors without losing the WAL 
data?

Thanks,
Ewan


[jira] [Created] (SPARK-13623) Relaxed mode for querying Dataframes, so columns that don't exist or have an incompatible schema return null rather than error

2016-03-02 Thread Ewan Leith (JIRA)
Ewan Leith created SPARK-13623:
--

 Summary: Relaxed mode for querying Dataframes, so columns that 
don't exist or have an incompatible schema return null rather than error
 Key: SPARK-13623
 URL: https://issues.apache.org/jira/browse/SPARK-13623
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Ewan Leith
Priority: Minor


Currently when querying a dataframe, if one record of many from a select 
statement is missing or has an invalid schema, then an error is raised such as:

{{org.apache.spark.sql.AnalysisException: cannot resolve 'data.stuff.onetype' 
due to data type mismatch: argument 2 requires integral type, however, 
'onetype' is of string type.;}}

Ideally, when doing ad-hoc querying of data, there would be an option for a 
relaxed mode where any missing or incompatible records in the selected columns 
are returned as a {{null}} instead of an error being raised for the whole set 
of data.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Ewan Leith
Thanks, I'll create the JIRA for it. Happy to help contribute to a patch if we 
can, not sure if my own scala skills will be up to it but perhaps one of my 
colleagues' will :)

Ewan

I don't think that exists right now, but it's definitely a good option to have. 
I myself have run into this issue a few times.

Can you create a JIRA ticket so we can track it? Would be even better if you 
are interested in working on a patch! Thanks.


On Wed, Mar 2, 2016 at 11:51 AM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:

Hi Reynold, yes that would be perfect for our use case.

I assume it doesn't exist though, otherwise I really need to go re-read the 
docs!

Thanks to both of you for replying by the way, I know you must be hugely busy.

Ewan

Are you looking for "relaxed" mode that simply return nulls for fields that 
doesn't exist or have incompatible schema?


On Wed, Mar 2, 2016 at 11:12 AM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:

Thanks Michael, it's not a great example really, as the data I'm working with 
has some source files that do fit the schema, and some that don't (out of 
millions that do work, perhaps 10 might not).

In an ideal world for us the select would probably return the valid records 
only.

We're trying out the new dataset APIs to see if we can do some pre-filtering 
that way.

Thanks,
Ewan

-dev +user

StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
 StructField(name,StringType,true)),true),true), 
StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
 StructField(id,LongType,true)),true),true)),true),true)),true),true))

Its not a great error message, but as the schema above shows, stuff is an 
array, not a struct.  So, you need to pick a particular element (using []) 
before you can pull out a specific field.  It would be easier to see this if 
you ran sqlContext.read.json(s1Rdd).printSchema(), which gives you a tree view. 
 Try the following.

sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff[0].onetype")

On Wed, Mar 2, 2016 at 1:44 AM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:
When you create a dataframe using the sqlContext.read.schema() API, if you pass 
in a schema that’s compatible with some of the records, but incompatible with 
others, it seems you can’t do a .select on the problematic columns, instead you 
get an AnalysisException error.

I know loading the wrong data isn’t good behaviour, but if you’re reading data 
from (for example) JSON files, there’s going to be malformed files along the 
way. I think it would be nice to handle this error in a nicer way, though I 
don’t know the best way to approach it.

Before I raise a JIRA ticket about it, would people consider this to be a bug 
or expected behaviour?

I’ve attached a couple of sample JSON files and the steps below to reproduce 
it, by taking the inferred schema from the simple1.json file, and applying it 
to a union of simple1.json and simple2.json. You can visually see the data has 
been parsed as I think you’d want if you do a .select on the parent column and 
print out the output, but when you do a select on the problem column you 
instead get an exception.

scala> val s1Rdd = sc.wholeTextFiles("/tmp/simple1.json").map(x => x._2)
s1Rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[171] at map at 
:27

scala> val s1schema = sqlContext.read.json(s1Rdd).schema
s1schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
 StructField(name,StringType,true)),true),true), 
StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
 StructField(id,LongType,true)),true),true)),true),true)),true),true))

scala> 
sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff").take(2).foreach(println)
[WrappedArray(WrappedArray([WrappedArray([1,John Doe], [2,Don Joeh]),null], 
[null,WrappedArray([ACME,2])]))]
[WrappedArray(WrappedArray([null,WrappedArray([null,1], [null,2])], 
[WrappedArray([2,null]),null]))]

scala> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff.onetype")
org.apache.spark.sql.AnalysisException: cannot resolve 'data.stuff[onetype]' 
due to data type mismatch: argument 2 requires integral type, however, 
'onetype' is of string type.;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
at 
org.apache.spark.sql.catalys

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Ewan Leith
Hi Reynold, yes that would be perfect for our use case.

I assume it doesn't exist though, otherwise I really need to go re-read the 
docs!

Thanks to both of you for replying by the way, I know you must be hugely busy.

Ewan

Are you looking for "relaxed" mode that simply return nulls for fields that 
doesn't exist or have incompatible schema?


On Wed, Mar 2, 2016 at 11:12 AM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:

Thanks Michael, it's not a great example really, as the data I'm working with 
has some source files that do fit the schema, and some that don't (out of 
millions that do work, perhaps 10 might not).

In an ideal world for us the select would probably return the valid records 
only.

We're trying out the new dataset APIs to see if we can do some pre-filtering 
that way.

Thanks,
Ewan

-dev +user

StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
 StructField(name,StringType,true)),true),true), 
StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
 StructField(id,LongType,true)),true),true)),true),true)),true),true))

Its not a great error message, but as the schema above shows, stuff is an 
array, not a struct.  So, you need to pick a particular element (using []) 
before you can pull out a specific field.  It would be easier to see this if 
you ran sqlContext.read.json(s1Rdd).printSchema(), which gives you a tree view. 
 Try the following.

sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff[0].onetype")

On Wed, Mar 2, 2016 at 1:44 AM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:
When you create a dataframe using the sqlContext.read.schema() API, if you pass 
in a schema that’s compatible with some of the records, but incompatible with 
others, it seems you can’t do a .select on the problematic columns, instead you 
get an AnalysisException error.

I know loading the wrong data isn’t good behaviour, but if you’re reading data 
from (for example) JSON files, there’s going to be malformed files along the 
way. I think it would be nice to handle this error in a nicer way, though I 
don’t know the best way to approach it.

Before I raise a JIRA ticket about it, would people consider this to be a bug 
or expected behaviour?

I’ve attached a couple of sample JSON files and the steps below to reproduce 
it, by taking the inferred schema from the simple1.json file, and applying it 
to a union of simple1.json and simple2.json. You can visually see the data has 
been parsed as I think you’d want if you do a .select on the parent column and 
print out the output, but when you do a select on the problem column you 
instead get an exception.

scala> val s1Rdd = sc.wholeTextFiles("/tmp/simple1.json").map(x => x._2)
s1Rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[171] at map at 
:27

scala> val s1schema = sqlContext.read.json(s1Rdd).schema
s1schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
 StructField(name,StringType,true)),true),true), 
StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
 StructField(id,LongType,true)),true),true)),true),true)),true),true))

scala> 
sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff").take(2).foreach(println)
[WrappedArray(WrappedArray([WrappedArray([1,John Doe], [2,Don Joeh]),null], 
[null,WrappedArray([ACME,2])]))]
[WrappedArray(WrappedArray([null,WrappedArray([null,1], [null,2])], 
[WrappedArray([2,null]),null]))]

scala> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff.onetype")
org.apache.spark.sql.AnalysisException: cannot resolve 'data.stuff[onetype]' 
due to data type mismatch: argument 2 requires integral type, however, 
'onetype' is of string type.;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)

(The full exception is attached too).

What do people think, is this a bug?

Thanks,
Ewan


-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr

Re: SFTP Compressed CSV into Dataframe

2016-03-02 Thread Ewan Leith
The Apache Commons library will let you access files on an SFTP server via a 
Java library, no local file handling involved

https://commons.apache.org/proper/commons-vfs/filesystems.html

Hope this helps,
Ewan

I wonder if anyone has opened a SFTP connection to open a remote GZIP CSV file? 
I am able to download the file first locally using the SFTP Client in the 
spark-sftp package. Then, I load the file into a dataframe using the spark-csv 
package, which automatically decompresses the file. I just want to remove the 
"downloading file to local" step and directly have the remote file 
decompressed, read, and loaded. Can someone give me any hints?

Thanks,
Ben



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Ewan Leith
Thanks Michael, it's not a great example really, as the data I'm working with 
has some source files that do fit the schema, and some that don't (out of 
millions that do work, perhaps 10 might not).

In an ideal world for us the select would probably return the valid records 
only.

We're trying out the new dataset APIs to see if we can do some pre-filtering 
that way.

Thanks,
Ewan

-dev +user

StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
 StructField(name,StringType,true)),true),true), 
StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
 StructField(id,LongType,true)),true),true)),true),true)),true),true))

Its not a great error message, but as the schema above shows, stuff is an 
array, not a struct.  So, you need to pick a particular element (using []) 
before you can pull out a specific field.  It would be easier to see this if 
you ran sqlContext.read.json(s1Rdd).printSchema(), which gives you a tree view. 
 Try the following.

sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff[0].onetype")

On Wed, Mar 2, 2016 at 1:44 AM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:
When you create a dataframe using the sqlContext.read.schema() API, if you pass 
in a schema that's compatible with some of the records, but incompatible with 
others, it seems you can't do a .select on the problematic columns, instead you 
get an AnalysisException error.

I know loading the wrong data isn't good behaviour, but if you're reading data 
from (for example) JSON files, there's going to be malformed files along the 
way. I think it would be nice to handle this error in a nicer way, though I 
don't know the best way to approach it.

Before I raise a JIRA ticket about it, would people consider this to be a bug 
or expected behaviour?

I've attached a couple of sample JSON files and the steps below to reproduce 
it, by taking the inferred schema from the simple1.json file, and applying it 
to a union of simple1.json and simple2.json. You can visually see the data has 
been parsed as I think you'd want if you do a .select on the parent column and 
print out the output, but when you do a select on the problem column you 
instead get an exception.

scala> val s1Rdd = sc.wholeTextFiles("/tmp/simple1.json").map(x => x._2)
s1Rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[171] at map at 
:27

scala> val s1schema = sqlContext.read.json(s1Rdd).schema
s1schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
 StructField(name,StringType,true)),true),true), 
StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
 StructField(id,LongType,true)),true),true)),true),true)),true),true))

scala> 
sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff").take(2).foreach(println)
[WrappedArray(WrappedArray([WrappedArray([1,John Doe], [2,Don Joeh]),null], 
[null,WrappedArray([ACME,2])]))]
[WrappedArray(WrappedArray([null,WrappedArray([null,1], [null,2])], 
[WrappedArray([2,null]),null]))]

scala> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff.onetype")
org.apache.spark.sql.AnalysisException: cannot resolve 'data.stuff[onetype]' 
due to data type mismatch: argument 2 requires integral type, however, 
'onetype' is of string type.;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)

(The full exception is attached too).

What do people think, is this a bug?

Thanks,
Ewan


-
To unsubscribe, e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>



Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Ewan Leith
When you create a dataframe using the sqlContext.read.schema() API, if you pass 
in a schema that's compatible with some of the records, but incompatible with 
others, it seems you can't do a .select on the problematic columns, instead you 
get an AnalysisException error.

I know loading the wrong data isn't good behaviour, but if you're reading data 
from (for example) JSON files, there's going to be malformed files along the 
way. I think it would be nice to handle this error in a nicer way, though I 
don't know the best way to approach it.

Before I raise a JIRA ticket about it, would people consider this to be a bug 
or expected behaviour?

I've attached a couple of sample JSON files and the steps below to reproduce 
it, by taking the inferred schema from the simple1.json file, and applying it 
to a union of simple1.json and simple2.json. You can visually see the data has 
been parsed as I think you'd want if you do a .select on the parent column and 
print out the output, but when you do a select on the problem column you 
instead get an exception.

scala> val s1Rdd = sc.wholeTextFiles("/tmp/simple1.json").map(x => x._2)
s1Rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[171] at map at 
:27

scala> val s1schema = sqlContext.read.json(s1Rdd).schema
s1schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(data,ArrayType(StructType(StructField(stuff,ArrayType(StructType(StructField(onetype,ArrayType(StructType(StructField(id,LongType,true),
 StructField(name,StringType,true)),true),true), 
StructField(othertype,ArrayType(StructType(StructField(company,StringType,true),
 StructField(id,LongType,true)),true),true)),true),true)),true),true))

scala> 
sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff").take(2).foreach(println)
[WrappedArray(WrappedArray([WrappedArray([1,John Doe], [2,Don Joeh]),null], 
[null,WrappedArray([ACME,2])]))]
[WrappedArray(WrappedArray([null,WrappedArray([null,1], [null,2])], 
[WrappedArray([2,null]),null]))]

scala> sqlContext.read.schema(s1schema).json(s2Rdd).select("data.stuff.onetype")
org.apache.spark.sql.AnalysisException: cannot resolve 'data.stuff[onetype]' 
due to data type mismatch: argument 2 requires integral type, however, 
'onetype' is of string type.;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)

(The full exception is attached too).

What do people think, is this a bug?

Thanks,
Ewan


simple1.json
Description: simple1.json


simple2.json
Description: simple2.json
org.apache.spark.sql.AnalysisException: cannot resolve 'data.stuff[onetype]' 
due to data type mismatch: argument 2 requires integral type, however, 
'onetype' is of string type.;
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
  at 

RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2016-01-26 Thread Ewan Leith
I’ve just tried running this using a normal stdin redirect:

~/spark/bin/spark-shell < simple.scala

Which worked, it started spark-shell, executed the script, the stopped the 
shell.

Thanks,
Ewan

From: Iulian Dragoș [mailto:iulian.dra...@typesafe.com]
Sent: 26 January 2016 15:00
To: fernandrez1987 
Cc: user 
Subject: Re: how to correctly run scala script using spark-shell through stdin 
(spark v1.0.0)


I don’t see -i in the output of spark-shell --help. Moreover, in master I get 
an error:

$ bin/spark-shell -i test.scala

bad option: '-i'

iulian
​

On Tue, Jan 26, 2016 at 3:47 PM, fernandrez1987 
> wrote:
spark-shell -i file.scala is not working for me in Spark 1.6.0, was this
removed or what do I have to take into account? The script does not get run
at all. What can be happening?








--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-correctly-run-scala-script-using-spark-shell-through-stdin-spark-v1-0-0-tp12972p26071.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org



--

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com



RE: Write to S3 with server side encryption in KMS mode

2016-01-26 Thread Ewan Leith
Hi Nisrina, I’m not aware of any support for KMS keys in s3n, s3a or the EMR 
specific EMRFS s3 driver.

If you’re using EMRFS with Amazon’s EMR, you can use KMS keys with client-side 
encryption

http://docs.aws.amazon.com/kms/latest/developerguide/services-emr.html#emrfs-encrypt

If this has changed, I’d love to know, but I’m pretty sure it hasn’t.

The alternative is to write to HDFS, then copy the data across in bulk.

Thanks,
Ewan



From: Nisrina Luthfiyati [mailto:nisrina.luthfiy...@gmail.com]
Sent: 26 January 2016 10:42
To: user 
Subject: Write to S3 with server side encryption in KMS mode

Hi all,

I'm trying to save a spark application output to a bucket in S3. The data is 
supposed to be encrypted with S3's server side encryption using KMS mode, which 
typically (using java api/cli) would require us to pass the sse-kms key when 
writing the data. I currently have not found a way to do this using spark 
hadoop config. Would anyone have any idea how this can be done or whether this 
is possible?

Thanks,
Nisrina.



RE: Spark 1.6.1

2016-01-25 Thread Ewan Leith
Hi Brandon,

It's relatively straightforward to try out different type options for this in 
the spark-shell, try pasting the attached code into spark-shell before you make 
a normal postgres JDBC connection.  

You can then experiment with the mappings without recompiling Spark or anything 
like that, and you can embed the same code in your own packages, outside of the 
main Spark releases.

Thanks,
Ewan

-Original Message-
From: BrandonBradley [mailto:bradleytas...@gmail.com] 
Sent: 22 January 2016 14:29
To: dev@spark.apache.org
Subject: Re: Spark 1.6.1

I'd like more complete Postgres JDBC support for ArrayType before the next 
release. Some of them are still broken in 1.6.0. It would save me much time.

Please see SPARK-12747 @ https://issues.apache.org/jira/browse/SPARK-12747

Cheers!
Brandon Bradley



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-6-1-tp16009p16082.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org

import java.sql.{Connection, Types}

import org.apache.spark.sql.types._
import org.apache.spark.sql.jdbc
import org.apache.spark.sql.jdbc.JdbcDialects
import org.apache.spark.sql.jdbc.JdbcDialect
import org.apache.spark.sql.jdbc.JdbcType

def toCatalystType(typeName: String): Option[DataType] = typeName match {
case "bool" => Some(BooleanType)
case "bit" => Some(BinaryType)
case "int2" => Some(ShortType)
case "int4" => Some(IntegerType)
case "int8" | "oid" => Some(LongType)
case "float4" => Some(FloatType)
case "money" | "float8" => Some(DoubleType)
case "text" | "varchar" | "char" | "cidr" | "inet" | "json" | "jsonb" | 
"uuid" =>
  Some(StringType)
case "bytea" => Some(BinaryType)
case "timestamp" | "timestamptz" | "time" | "timetz" => Some(TimestampType)
case "date" => Some(DateType)
case "numeric" => Some(DecimalType.SYSTEM_DEFAULT)
case _ => None
}

case object PostgresDialect extends JdbcDialect {
  override def canHandle(url: String): Boolean = 
url.startsWith("jdbc:postgresql")

  override def getCatalystType(
  sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): 
Option[DataType] = {
if (sqlType == Types.BIT && typeName.equals("bit") && size != 1) {
  Some(BinaryType)
} else if (sqlType == Types.OTHER) {
  toCatalystType(typeName).filter(_ == StringType)
} else if (sqlType == Types.ARRAY && typeName.length > 1 && typeName(0) == 
'_') {
  toCatalystType(typeName.drop(1)).map(ArrayType(_))
} else None
  }
}

JdbcDialects.registerDialect(PostgresDialect)
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12764) XML Column type is not supported

2016-01-12 Thread Ewan Leith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093836#comment-15093836
 ] 

Ewan Leith edited comment on SPARK-12764 at 1/12/16 12:53 PM:
--

What are you expecting it to do, output the XML as a string, or something else?

I doubt this will work, but you might try adding this code before the initial 
references to JDBC:

case object PostgresDialect extends JdbcDialect {
  override def canHandle(url: String): Boolean = 
url.startsWith("jdbc:postgresql")
  override def getCatalystType(
sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): 
Option[DataType] = {
if (typeName.contains("xml")) {
Some(StringType)
} else None
  }
}

JdbcDialects.registerDialect(PostgresDialect)



was (Author: ewanleith):
What are you expecting it to do, output the XML as a string, or something else?

> XML Column type is not supported
> 
>
> Key: SPARK-12764
> URL: https://issues.apache.org/jira/browse/SPARK-12764
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.0
> Environment: Mac Os X El Capitan
>Reporter: Rajeshwar Gaini
>
> Hi All,
> I am using PostgreSQL database. I am using the following jdbc call to access 
> a customer table (customer_id int, event text, country text, content xml) in 
> my database.
> {code}
> val dataframe1 = sqlContext.load("jdbc", Map("url" -> 
> "jdbc:postgresql://localhost/customerlogs?user=postgres=postgres", 
> "dbtable" -> "customer"))
> {code}
> When i run above command in spark-shell i receive the following error.
> {code}
> java.sql.SQLException: Unsupported type 
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>   at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:25)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:34)
>   at $iwC$$iwC$$iwC$$iwC.(:36)
>   at $iwC$$iwC$$iwC.(:38)
>   at $iwC$$iwC.(:40)
>   at $iwC.(:42)
>   at (:44)
>   at .(:48)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>   a

[jira] [Commented] (SPARK-12764) XML Column type is not supported

2016-01-12 Thread Ewan Leith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093836#comment-15093836
 ] 

Ewan Leith commented on SPARK-12764:


What are you expecting it to do, output the XML as a string, or something else?

> XML Column type is not supported
> 
>
> Key: SPARK-12764
> URL: https://issues.apache.org/jira/browse/SPARK-12764
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.0
> Environment: Mac Os X El Capitan
>Reporter: Rajeshwar Gaini
>
> Hi All,
> I am using PostgreSQL database. I am using the following jdbc call to access 
> a customer table (customer_id int, event text, country text, content xml) in 
> my database.
> {code}
> val dataframe1 = sqlContext.load("jdbc", Map("url" -> 
> "jdbc:postgresql://localhost/customerlogs?user=postgres=postgres", 
> "dbtable" -> "customer"))
> {code}
> When i run above command in spark-shell i receive the following error.
> {code}
> java.sql.SQLException: Unsupported type 
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>   at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:25)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:34)
>   at $iwC$$iwC$$iwC$$iwC.(:36)
>   at $iwC$$iwC$$iwC.(:38)
>   at $iwC$$iwC.(:40)
>   at $iwC.(:42)
>   at (:44)
>   at .(:48)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
>   at org.apache.spark.repl.Main$.main(Main.scala:31)
>   at org.apache.spark.repl.Main.main(Main

[jira] [Comment Edited] (SPARK-12764) XML Column type is not supported

2016-01-12 Thread Ewan Leith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093836#comment-15093836
 ] 

Ewan Leith edited comment on SPARK-12764 at 1/12/16 12:57 PM:
--

What are you expecting it to do, output the XML as a string, or something else?

I doubt this will work, but you might try adding this code before the initial 
references to JDBC:
{noformat}
import org.apache.spark.sql.jdbc.JdbcDialects
import org.apache.spark.sql.jdbc.JdbcDialect
import org.apache.spark.sql.types._

case object PostgresDialect extends JdbcDialect {
  override def canHandle(url: String): Boolean = 
url.startsWith("jdbc:postgresql")
  override def getCatalystType(
sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): 
Option[DataType] = {
if (typeName.contains("xml")) {
Some(StringType)
} else None
  }
}

JdbcDialects.registerDialect(PostgresDialect)
{noformat}


was (Author: ewanleith):
What are you expecting it to do, output the XML as a string, or something else?

I doubt this will work, but you might try adding this code before the initial 
references to JDBC:

case object PostgresDialect extends JdbcDialect {
  override def canHandle(url: String): Boolean = 
url.startsWith("jdbc:postgresql")
  override def getCatalystType(
sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): 
Option[DataType] = {
if (typeName.contains("xml")) {
Some(StringType)
} else None
  }
}

JdbcDialects.registerDialect(PostgresDialect)


> XML Column type is not supported
> 
>
> Key: SPARK-12764
> URL: https://issues.apache.org/jira/browse/SPARK-12764
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.0
> Environment: Mac Os X El Capitan
>Reporter: Rajeshwar Gaini
>
> Hi All,
> I am using PostgreSQL database. I am using the following jdbc call to access 
> a customer table (customer_id int, event text, country text, content xml) in 
> my database.
> {code}
> val dataframe1 = sqlContext.load("jdbc", Map("url" -> 
> "jdbc:postgresql://localhost/customerlogs?user=postgres=postgres", 
> "dbtable" -> "customer"))
> {code}
> When i run above command in spark-shell i receive the following error.
> {code}
> java.sql.SQLException: Unsupported type 
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>   at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:25)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:34)
>   at $iwC$$iwC$$iwC$$iwC.(:36)
>   at $iwC$$iwC$$iwC.(:38)
>   at $iwC$$iwC.(:40)
>   at $iwC.(:42)
>   at (:44)
>   at .(:48)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apa

RE: Out of memory issue

2016-01-06 Thread Ewan Leith
Hi Muthu, this could be related to a known issue in the release notes

http://spark.apache.org/releases/spark-release-1-6-0.html

Known issues

SPARK-12546 -  Save DataFrame/table as Parquet with dynamic partitions may 
cause OOM; this can be worked around by decreasing the memory used by both 
Spark and Parquet using spark.memory.fraction (for example, 0.4) and 
parquet.memory.pool.ratio (for example, 0.3, in Hadoop configuration, e.g. 
setting it in core-site.xml).

It's definitely worth setting spark.memory.fraction and 
parquet.memory.pool.ratio and trying again.

Ewan

-Original Message-
From: babloo80 [mailto:bablo...@gmail.com] 
Sent: 06 January 2016 03:44
To: user@spark.apache.org
Subject: Out of memory issue

Hello there,

I have a spark job reads 7 parquet files (8 GB, 3 x 16 GB, 3 x 14 GB) in 
different stages of execution and creates a result parquet of 9 GB (about 27 
million rows containing 165 columns. some columns are map based containing 
utmost 200 value histograms). The stages involve, Step 1: Reading the data 
using dataframe api Step 2: Transform dataframe to RDD (as the some of the 
columns are transformed into histograms (using empirical distribution to cap 
the number of keys) and some of them run like UDAF during reduce-by-key step) 
to perform and perform some transformations Step 3: Reduce the result by key so 
that the resultant can be used in the next stage for join Step 4: Perform left 
outer join of this result which runs similar Steps 1 thru 3. 
Step 5: The results are further reduced to be written to parquet

With Apache Spark 1.5.2, I am able to run the job with no issues.
Current env uses 8 nodes running a total of  320 cores, 100 GB executor memory 
per node with driver program using 32 GB. The approximate execution time is 
about 1.2 hrs. The parquet files are stored in another HDFS cluster for read 
and eventual write of the result.

When the same job is executed using Apache 1.6.0, some of the executor node's 
JVM gets restarted (with a new executor id). On further turning-on GC stats on 
the executor, the perm-gen seem to get maxed out and ends up showing the 
symptom of out-of-memory. 

Please advice on where to start investigating this issue. 

Thanks,
Muthu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-issue-tp25888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: How to accelerate reading json file?

2016-01-06 Thread Ewan Leith
If you already know the schema, then you can run the read with the schema 
parameter like this:


val path = "examples/src/main/resources/jsonfile"

val jsonSchema =  StructType(
StructField("id",StringType,true) ::
StructField("reference",LongType,true) ::
StructField("details",detailsSchema, true) ::
StructField("value",StringType,true) ::Nil)

val people = sqlContext.read.schema(jsonSchema).json(path)
If you have the schema defined as a separate small JSON file, then you can load 
it by running something like this line to load it directly:

val jsonSchema = sqlContext.read.json(“path/to/schema”).schema

Thanks,
Ewan

From: Gavin Yue [mailto:yue.yuany...@gmail.com]
Sent: 06 January 2016 07:14
To: user 
Subject: How to accelerate reading json file?

I am trying to read json files following the example:

val path = "examples/src/main/resources/jsonfile"

val people = sqlContext.read.json(path)

I have 1 Tb size files in the path.  It took 1.2 hours to finish the reading to 
infer the schema.

But I already know the schema. Could I make this process short?

Thanks a lot.





Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ewan Leith
Hi all,

I'm sure this must have been solved already, but I can't see anything obvious.

Using Spark Streaming, I'm trying to execute a transform function on a DStream 
at short batch intervals (e.g. 1 second), but only write the resulting data to 
disk using saveAsTextFiles in a larger batch after a longer delay (say 60 
seconds).

I thought the ReceiverInputDStream window function might be a good help here, 
but instead, applying it to a transformed DStream causes the transform function 
to only execute at the end of the window too.

Has anyone got a solution to this?

Thanks,
Ewan





RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ewan Leith
Yeah it's awkward, the transforms being done are fairly time sensitive, so I 
don't want them to wait 60 seconds or more.

I might have to move the code from a transform into a custom receiver instead, 
so they'll be processed outside the window length. A buffered writer is a good 
idea too, thanks.

Thanks,
Ewan

From: Ashic Mahtab [mailto:as...@live.com]
Sent: 31 December 2015 13:50
To: Ewan Leith <ewan.le...@realitymine.com>; Apache Spark 
<user@spark.apache.org>
Subject: RE: Batch together RDDs for Streaming output, without delaying 
execution of map or transform functions

Hi Ewan,
Transforms are definitions of what needs to be done - they don't execute until 
and action is triggered. For what you want, I think you might need to have an 
action that writes out rdds to some sort of buffered writer.

-Ashic.

From: ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Batch together RDDs for Streaming output, without delaying execution 
of map or transform functions
Date: Thu, 31 Dec 2015 11:35:37 +
Hi all,

I'm sure this must have been solved already, but I can't see anything obvious.

Using Spark Streaming, I'm trying to execute a transform function on a DStream 
at short batch intervals (e.g. 1 second), but only write the resulting data to 
disk using saveAsTextFiles in a larger batch after a longer delay (say 60 
seconds).

I thought the ReceiverInputDStream window function might be a good help here, 
but instead, applying it to a transformed DStream causes the transform function 
to only execute at the end of the window too.

Has anyone got a solution to this?

Thanks,
Ewan





[jira] [Commented] (SPARK-11948) Permanent UDF not work

2015-12-14 Thread Ewan Leith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056155#comment-15056155
 ] 

Ewan Leith commented on SPARK-11948:


I think this is a duplicate of SPARK-11609 ?

> Permanent UDF not work
> --
>
> Key: SPARK-11948
> URL: https://issues.apache.org/jira/browse/SPARK-11948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Weizhong
>Priority: Minor
>
> We create a function,
> {noformat}
> add jar /home/test/smartcare-udf-0.0.1-SNAPSHOT.jar;
> create function arr_greater_equal as 
> 'smartcare.dac.hive.udf.UDFArrayGreaterEqual';
> {noformat}
>  but "show functions" don't display, and when we create the same function 
> again, it throw exception as below:
> {noformat}
> Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.FunctionTask. 
> AlreadyExistsException(message:Function arr_greater_equal already exists) 
> (state=,code=0)
> {noformat}
> But if we use this function, it throw exception as below:
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: undefined function 
> arr_greater_equal; line 1 pos 119 (state=,code=0)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



RE: Size exceeds Integer.MAX_VALUE on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Ewan Leith
How big do you expect the file to be? Spark has issues with single blocks over 
2GB (see https://issues.apache.org/jira/browse/SPARK-1476 and 
https://issues.apache.org/jira/browse/SPARK-6235 for example)

If you don’t know, try running

df.repartition(100).write.format…

to get an idea of how  big it would be, I assume it’s over 2 GB

From: Zhang, Jingyu [mailto:jingyu.zh...@news.com.au]
Sent: 16 November 2015 10:17
To: user 
Subject: Size exceeds Integer.MAX_VALUE on EMR 4.0.0 Spark 1.4.1


I am using spark-csv to save files in s3, it shown Size exceeds. Please let me 
know how to fix it. Thanks.

df.write()

.format("com.databricks.spark.csv")

.option("header", "true")

.save("s3://newcars.csv");

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:860)

at 
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)

at 
org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)

at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)

at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)

at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)

at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:511)

at 
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:429)

at org.apache.spark.storage.BlockManager.get(BlockManager.scala:617)

at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)

at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)

at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)

at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:70)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)


This message and its attachments may contain legally privileged or confidential 
information. It is intended solely for the named addressee. If you are not the 
addressee indicated in this message or responsible for delivery of the message 
to the addressee, you may not copy or deliver this message or its attachments 
to anyone. Rather, you should permanently delete this message and its 
attachments and kindly notify the sender by reply e-mail. Any content of this 
message and its attachments which does not relate to the official business of 
the sending company must be taken not to have been sent or endorsed by that 
company or any of its related entities. No warranty is made that the e-mail or 
attachments are free from computer virus or other defect.


RE: Dataframe nested schema inference from Json without type conflicts

2015-10-23 Thread Ewan Leith
Hi all,

It’s taken us a while, but one of my colleagues has made the pull request on 
github for our proposed solution to this,

https://issues.apache.org/jira/browse/SPARK-10947
https://github.com/apache/spark/pull/9249

It adds a parameter to the Json read otpions to force all primitives as a 
String type:

val jsonDf = sqlContext.read.option("primitivesAsString", 
"true").json(sampleJsonFile)

scala> jsonDf.printSchema()
root
|-- bigInteger: string (nullable = true)
|-- boolean: string (nullable = true)
|-- double: string (nullable = true)
|-- integer: string (nullable = true)
|-- long: string (nullable = true)
|-- null: string (nullable = true)
|-- string: string (nullable = true)

Thanks,
Ewan

From: Yin Huai [mailto:yh...@databricks.com]
Sent: 01 October 2015 23:54
To: Ewan Leith <ewan.le...@realitymine.com>
Cc: r...@databricks.com; dev@spark.apache.org
Subject: Re: Dataframe nested schema inference from Json without type conflicts

Hi Ewan,

For your use case, you only need the schema inference to pick up the structure 
of your data (basically you want spark sql to infer the type of complex values 
like arrays and structs but keep the type of primitive values as strings), 
right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:

We could, but if a client sends some unexpected records in the schema (which 
happens more than I'd like, our schema seems to constantly evolve), its 
fantastic how Spark picks up on that data and includes it.



Passing in a fixed schema loses that nice additional ability, though it's what 
we'll probably have to adopt if we can't come up with a way to keep the 
inference working.



Thanks,

Ewan



-- Original message--

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we’re using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it’ll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we’d rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don’t think there’s currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan




RE: Spark Streaming - use the data in different jobs

2015-10-19 Thread Ewan Leith
Storing the data in HBase, Cassandra, or similar is possibly the right answer, 
the other option that can work well is re-publishing the data back into second 
queue on RabbitMQ, to be read again by the next job.

Thanks,
Ewan

From: Oded Maimon [mailto:o...@scene53.com]
Sent: 18 October 2015 12:49
To: user 
Subject: Spark Streaming - use the data in different jobs

Hi,
we've build a spark streaming process that get data from a pub/sub (rabbitmq in 
our case).

now we want the streamed data to be used in different spark jobs (also in 
realtime if possible)

what options do we have for doing that ?


  *   can the streaming process and different spark jobs share/access the same 
RDD's?
  *   can the streaming process create a sparkSQL table and other jobs read/use 
it?
  *   can a spark streaming process trigger other spark jobs and send the the 
data (in memory)?
  *   can a spark streaming process cache the data in memory and other 
scheduled jobs access same rdd's?
  *   should we keep the data to hbase and read it from other jobs?
  *   other ways?

I believe that the answer will be using external db/storage..  hoping to have a 
different solution :)

Thanks.


Regards,
Oded Maimon
Scene53.


This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
Please note that any disclosure, copying or distribution of the content of this 
information is strictly forbidden. If you have received this email message in 
error, please destroy it immediately and notify its sender.


RE: Should I convert json into parquet?

2015-10-19 Thread Ewan Leith
As Jörn says, Parquet and ORC will get you really good compression and can be 
much faster. There also some nice additions around predicate pushdown which can 
be great if you've got wide tables.

Parquet is obviously easier to use, since it's bundled into Spark. Using ORC is 
described here 
http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/

Thanks,
Ewan

-Original Message-
From: Jörn Franke [mailto:jornfra...@gmail.com] 
Sent: 19 October 2015 06:32
To: Gavin Yue 
Cc: user 
Subject: Re: Should I convert json into parquet?



Good Formats are Parquet or ORC. Both can be useful with compression, such as 
Snappy.   They are much faster than JSON. however, the table structure is up to 
you and depends on your use case.

> On 17 Oct 2015, at 23:07, Gavin Yue  wrote:
> 
> I have json files which contains timestamped events.  Each event associate 
> with a user id. 
> 
> Now I want to group by user id. So converts from
> 
> Event1 -> UserIDA;
> Event2 -> UserIDA;
> Event3 -> UserIDB;
> 
> To intermediate storage. 
> UserIDA -> (Event1, Event2...)
> UserIDB-> (Event3...)
> 
> Then I will label positives and featurize the Events Vector in many different 
> ways, fit each of them into the Logistic Regression. 
> 
> I want to save intermediate storage permanently since it will be used many 
> times.  And there will new events coming every day. So I need to update this 
> intermediate storage every day. 
> 
> Right now I store intermediate storage using Json files.  Should I use 
> Parquet instead?  Or is there better solutions for this use case?
> 
> Thanks a lot !
> 
> 
> 
> 
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[jira] [Created] (SPARK-10947) With schema inference from JSON into a Dataframe, add option to infer all primitive object types as strings

2015-10-06 Thread Ewan Leith (JIRA)
Ewan Leith created SPARK-10947:
--

 Summary: With schema inference from JSON into a Dataframe, add 
option to infer all primitive object types as strings
 Key: SPARK-10947
 URL: https://issues.apache.org/jira/browse/SPARK-10947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.1
Reporter: Ewan Leith
Priority: Minor


Currently, when a schema is inferred from a JSON file using 
sqlContext.read.json, the primitive object types are inferred as string, long, 
boolean, etc.

However, if the inferred type is too specific (JSON obviously does not enforce 
types itself), this causes issues with merging dataframe schemas.

Instead, we would like an option in the JSON inferField function to treat all 
primitive objects as strings.

We'll create and submit a pull request for this for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Dataframe nested schema inference from Json without type conflicts

2015-10-05 Thread Ewan Leith
Thanks Yin, I'll put together a JIRA and a PR tomorrow.


Ewan


-- Original message--

From: Yin Huai

Date: Mon, 5 Oct 2015 17:39

To: Ewan Leith;

Cc: dev@spark.apache.org;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


Hello Ewan,

Adding a JSON-specific option makes sense. Can you open a JIRA for this? Also, 
sending out a PR will be great. For JSONRelation, I think we can pass all 
user-specific options to it (see 
org.apache.spark.sql.execution.datasources.json.DefaultSource's createRelation) 
just like what we do for ParquetRelation. Then, inside JSONRelation, we figure 
out what kind of options that have been specified.

Thanks,

Yin

On Mon, Oct 5, 2015 at 9:04 AM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:
I've done some digging today and, as a quick and ugly fix, altering the case 
statement of the JSON inferField function in InferSchema.scala

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala

to have

case VALUE_STRING | VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT | VALUE_TRUE | 
VALUE_FALSE => StringType

rather than the rules for each type works as we'd want.

If we were to wrap this up in a configuration setting in JSONRelation like the 
samplingRatio setting, with the default being to behave as it currently works, 
does anyone think a pull request would plausibly get into the Spark main 
codebase?

Thanks,
Ewan



From: Ewan Leith 
[mailto:ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>]
Sent: 02 October 2015 01:57
To: yh...@databricks.com<mailto:yh...@databricks.com>

Cc: r...@databricks.com<mailto:r...@databricks.com>; 
dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Dataframe nested schema inference from Json without type conflicts


Exactly, that's a much better way to put it.



Thanks,

Ewan



-- Original message--

From: Yin Huai

Date: Thu, 1 Oct 2015 23:54

To: Ewan Leith;

Cc: 
r...@databricks.com;dev@spark.apache.org<mailto:r...@databricks.com;dev@spark.apache.org>;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


Hi Ewan,

For your use case, you only need the schema inference to pick up the structure 
of your data (basically you want spark sql to infer the type of complex values 
like arrays and structs but keep the type of primitive values as strings), 
right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:

We could, but if a client sends some unexpected records in the schema (which 
happens more than I'd like, our schema seems to constantly evolve), its 
fantastic how Spark picks up on that data and includes it.



Passing in a fixed schema loses that nice additional ability, though it's what 
we'll probably have to adopt if we can't come up with a way to keep the 
inference working.



Thanks,

Ewan



-- Original message------

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we're using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it'll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we'd rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don't think there's currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan





RE: Dataframe nested schema inference from Json without type conflicts

2015-10-05 Thread Ewan Leith
I've done some digging today and, as a quick and ugly fix, altering the case 
statement of the JSON inferField function in InferSchema.scala

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala

to have

case VALUE_STRING | VALUE_NUMBER_INT | VALUE_NUMBER_FLOAT | VALUE_TRUE | 
VALUE_FALSE => StringType

rather than the rules for each type works as we'd want.

If we were to wrap this up in a configuration setting in JSONRelation like the 
samplingRatio setting, with the default being to behave as it currently works, 
does anyone think a pull request would plausibly get into the Spark main 
codebase?

Thanks,
Ewan



From: Ewan Leith [mailto:ewan.le...@realitymine.com]
Sent: 02 October 2015 01:57
To: yh...@databricks.com
Cc: r...@databricks.com; dev@spark.apache.org
Subject: Re: Dataframe nested schema inference from Json without type conflicts


Exactly, that's a much better way to put it.



Thanks,

Ewan



-- Original message--

From: Yin Huai

Date: Thu, 1 Oct 2015 23:54

To: Ewan Leith;

Cc: 
r...@databricks.com;dev@spark.apache.org<mailto:r...@databricks.com;dev@spark.apache.org>;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


Hi Ewan,

For your use case, you only need the schema inference to pick up the structure 
of your data (basically you want spark sql to infer the type of complex values 
like arrays and structs but keep the type of primitive values as strings), 
right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:

We could, but if a client sends some unexpected records in the schema (which 
happens more than I'd like, our schema seems to constantly evolve), its 
fantastic how Spark picks up on that data and includes it.



Passing in a fixed schema loses that nice additional ability, though it's what 
we'll probably have to adopt if we can't come up with a way to keep the 
inference working.



Thanks,

Ewan



-- Original message--

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we're using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it'll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we'd rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don't think there's currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan




Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Ewan Leith
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we're using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it'll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we'd rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don't think there's currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan


Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Ewan Leith
We could, but if a client sends some unexpected records in the schema (which 
happens more than I'd like, our schema seems to constantly evolve), its 
fantastic how Spark picks up on that data and includes it.


Passing in a fixed schema loses that nice additional ability, though it's what 
we'll probably have to adopt if we can't come up with a way to keep the 
inference working.


Thanks,

Ewan


-- Original message--

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: dev@spark.apache.org;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we're using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it'll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we'd rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don't think there's currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan



Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Ewan Leith
Exactly, that's a much better way to put it.


Thanks,

Ewan


-- Original message--

From: Yin Huai

Date: Thu, 1 Oct 2015 23:54

To: Ewan Leith;

Cc: r...@databricks.com;dev@spark.apache.org;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


Hi Ewan,

For your use case, you only need the schema inference to pick up the structure 
of your data (basically you want spark sql to infer the type of complex values 
like arrays and structs but keep the type of primitive values as strings), 
right?

Thanks,

Yin

On Thu, Oct 1, 2015 at 2:27 PM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:

We could, but if a client sends some unexpected records in the schema (which 
happens more than I'd like, our schema seems to constantly evolve), its 
fantastic how Spark picks up on that data and includes it.


Passing in a fixed schema loses that nice additional ability, though it's what 
we'll probably have to adopt if we can't come up with a way to keep the 
inference working.


Thanks,

Ewan


-- Original message--

From: Reynold Xin

Date: Thu, 1 Oct 2015 22:12

To: Ewan Leith;

Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>;

Subject:Re: Dataframe nested schema inference from Json without type conflicts


You can pass the schema into json directly, can't you?

On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:
Hi all,

We really like the ability to infer a schema from JSON contained in an RDD, but 
when we're using Spark Streaming on small batches of data, we sometimes find 
that Spark infers a more specific type than it should use, for example if the 
json in that small batch only contains integer values for a String field, it'll 
class the field as an Integer type on one Streaming batch, then a String on the 
next one.

Instead, we'd rather match every value as a String type, then handle any 
casting to a desired type later in the process.

I don't think there's currently any simple way to avoid this that I can see, 
but we could add the functionality in the JacksonParser.scala file, probably in 
convertField.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Does anyone know an easier and cleaner way to do this?

Thanks,
Ewan




RE: Need for advice - performance improvement and out of memory resolution

2015-09-30 Thread Ewan Leith
Try reducing the number of workers to 2, and increasing their memory up to 6GB.

However I've seen mention of a bug in the pyspark API for when calling head() 
on a dataframe in spark 1.5.0 and 1.4, it's got a big performance hit.

https://issues.apache.org/jira/browse/SPARK-10731

It's fixed in spark 1.5.1 which was released yesterday, so maybe try upgrading.

Ewan


-Original Message-
From: camelia [mailto:came...@chalmers.se] 
Sent: 30 September 2015 10:51
To: user@spark.apache.org
Subject: Need for advice - performance improvement and out of memory resolution

Hello,

I am working on a machine learning project, currently using
spark-1.4.1-bin-hadoop2.6 in local mode on a laptop (Ubuntu 14.04 OS running on 
a Dell laptop with i7-5600@2.6 GHz * 4 cores, 15.6 GB RAM). I also mention 
working in Python from an IPython notebook.
 

I face the following problem: when working with a Dataframe created from a CSV 
file (2.7 GB) with schema inferred (1900 features), the time it takes for Spark 
to count the 145231 rows is 3:30 minutes using 4 cores. Longer times are 
recorder for computing one feature's statistics, for example:


START AT: 2015-09-21
08:56:41.136947
 
+---+--+
|summary|  VAR_1933|
+---+--+
|  count|145231|
|   mean| 8849.839111484464|
| stddev|3175.7863998269395|
|min| 0|
|max|  |
+---+--+

 
FINISH AT: 2015-09-21
09:02:49.452260



So, my first question would be what configuration parameters to set in order to 
improve this performance?

I tried some explicit configuration in the IPython notebook, but specifying 
resources explicitly when creating the Spark configuration resulted in worse 
performance; I mean :

config =
SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars",
jar_path)

worked twice faster than:

config =
SparkConf().setAppName("cleankaggle").setMaster("local[4]").set("spark.jars",
jar_path).set("spark.driver.memory", "2g").set("spark.python.worker.memory
", "3g")





Secondly, when I do the one hot encoding (I tried two different ways of keeping 
results) I don't arrive at showing the head(1) of the resulted dataframe. We 
have the function :

def OHE_transform(categ_feat, df_old):
outputcolname = categ_feat + "_ohe_index"
outputcolvect = categ_feat + "_ohe_vector"
stringIndexer = StringIndexer(inputCol=categ_feat,
outputCol=outputcolname)
indexed = stringIndexer.fit(df_old).transform(df_old)
encoder = OneHotEncoder(inputCol=outputcolname, outputCol=outputcolvect)
encoded = encoder.transform(indexed)
return encoded


The two manners for keeping results are depicted below:

1)

result = OHE_transform(list_top_feat[0], df_top_categ) for item in 
list_top_feat[1:]:
result = OHE_transform(item, result)
result.head(1)


2)

df_result = OHE_transform("VAR_A", df_top_categ)
df_result_1 = OHE_transform("VAR_B", df_result)
df_result_2 = OHE_transform("VAR_C", df_result_1) ...
df_result_12 = OHE_transform("VAR_X", df_result_11)
df_result_12.head(1)

In the first approach, at the third iteration (in the for loop), when it was 
supposed to print the head(1), the IPython notebook  remained in the state 
"Kernel busy" for several hours and then I interrupted the kernel.
The second approach managed to go through all transformations (please note that 
here I eliminated the intermediary prints of the head(1)), but it gave an "out 
of memory" error at the only (final result) head(1),  that I paste below :

===

 

df_result_12.head(1)

---
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 df_result_12.head(1)

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in 
head(self, n)
649 rs = self.head(1)
650 return rs[0] if rs else None
--> 651 return self.take(n)
652 
653 @ignore_unicode_prefix

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in 
take(self, num)
305 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
306 """
--> 307 return self.limit(num).collect()
308 
309 @ignore_unicode_prefix

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in
collect(self)
279 """
280 with SCCallSiteSync(self._sc) as css:
--> 281 port =
self._sc._jvm.PythonRDD.collectAndServe(self._jdf.javaToPython().rdd())
282 rs = list(_load_from_socket(port,
BatchedSerializer(PickleSerializer(
283 cls = _create_cls(self.schema)

/home/camelia/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
in __call__(self, 

RE: Converting a DStream to schemaRDD

2015-09-29 Thread Ewan Leith
Something like:

dstream.foreachRDD { rdd =>
  val df =  sqlContext.read.json(rdd)
  df.select(…)
}

https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams


Might be the place to start, it’ll convert each batch of dstream into an RDD 
then let you work it as if it were a standard RDD dataset.

Ewan


From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com]
Sent: 29 September 2015 15:03
To: user 
Subject: Converting a DStream to schemaRDD

Hi,
I have a DStream which is a stream of RDD[String].

How can I pass a DStream to sqlContext.jsonRDD and work with it as a DF ?

Thank you.
Daniel



SQLContext.read().json() inferred schema - force type to strings?

2015-09-25 Thread Ewan Leith
Hi all,

We're uising SQLContext.read.json to read in a stream of JSON datasets, but 
sometimes the inferred schema contains for the same value a LongType, and 
sometimes a DoubleType.

This obviously causes problems with merging the schema, so does anyone know a 
way of forcing the inferred schema to always mark every type as a StringType, 
then we can handle the type checking ourselves?

I know we could use a specified schema, but that loses some of the flexibility 
we're getting at the moment.

Thanks,
Ewan


Re: Zeppelin on Yarn : org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submi

2015-09-19 Thread Ewan Leith
yarn-client still runs the executor tasks on the cluster, the main difference 
is where the driver job runs.


Thanks,

Ewan


-- Original message--

From: shahab

Date: Fri, 18 Sep 2015 13:11

To: Aniket Bhatnagar;

Cc: user@spark.apache.org;

Subject:Re: Zeppelin on Yarn : org.apache.spark.SparkException: Detected 
yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not 
supported directly by SparkContext. Please use spark-submit.


It works using yarn-client but I want to make it running on cluster. Is there 
any way to do so?

best,
/Shahab

On Fri, Sep 18, 2015 at 12:54 PM, Aniket Bhatnagar 
> wrote:

Can you try yarn-client mode?

On Fri, Sep 18, 2015, 3:38 PM shahab 
> wrote:
Hi,

Probably I have wrong zeppelin  configuration, because I get the following 
error when I execute spark statements in Zeppelin:

org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running 
on a cluster. Deployment to YARN is not supported directly by SparkContext. 
Please use spark-submit.


Anyone knows What's the solution to this?

best,
/Shahab



Re: [Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-10 Thread Ewan Leith
The last time I checked, if you launch EMR 4 with only Spark selected as an 
application, HDFS isn't correctly installed.


Did you select another application like Hive at launch time as well as Spark? 
If not, try that.


Thanks,

Ewan


-- Original message--

From: Dean Wampler

Date: Wed, 9 Sep 2015 22:29

To: shahab;

Cc: user@spark.apache.org;

Subject:Re: [Spark on Amazon EMR] : File does not exist: 
hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar


If you log into the cluster, do you see the file if you type:

hdfs dfs -ls 
hdfs://ipx-x-x-x:8020/user/hadoop/.sparkStaging/application_123344567_0018/spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

(with the correct server address for "ipx-x-x-x"). If not, is the server 
address correct and routable inside the cluster. Recall that EC2 instances have 
both public and private host names & IP addresses.

Also, is the port number correct for HDFS in the cluster?

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd 
Edition (O'Reilly)
Typesafe
@deanwampler
http://polyglotprogramming.com

On Wed, Sep 9, 2015 at 9:28 AM, shahab 
> wrote:
Hi,
I am using Spark on Amazon EMR. So far I have not succeeded to submit the 
application successfully, not sure what's problem. In the log file I see the 
followings.
java.io.FileNotFoundException: File does not exist: 
hdfs://ipx-x-x-x:8020/user/hadoop/.sparkStaging/application_123344567_0018/spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

However, even putting spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar in the fat 
jar file didn't solve the problem. I am out of clue now.
I want to submit a spark application, using aws web console, as a step. I 
submit the application as : spark-submit --deploy-mode cluster --class 
mypack.MyMainClass --master yarn-cluster s3://mybucket/MySparkApp.jar Is there 
any one who has similar problem with EMR?

best,
/Shahab



RE: NOT IN in Spark SQL

2015-09-04 Thread Ewan Leith
Spark SQL doesn’t support “NOT IN”, but I think HiveQL does, so give using the 
HiveContext a try rather than SQLContext. Here’s the spark 1.2 docs on it, but 
it’s basically identical to running the SQLContext

https://spark.apache.org/docs/1.2.0/sql-programming-guide.html#tab_scala_6
https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/sql/hive/HiveContext.html

Thanks,
Ewan

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 04 September 2015 13:12
To: Pietro Gentile 
Cc: user@spark.apache.org
Subject: Re: NOT IN in Spark SQL

I think spark doesn't support NOT IN clauses, but you can do the same with a 
LEFT OUTER JOIN, Something like:

SELECT A.id FROM A LEFT OUTER JOIN B ON (B.id = A.id) WHERE B.id IS null

Thanks
Best Regards

On Thu, Sep 3, 2015 at 8:46 PM, Pietro Gentile 
>
 wrote:
Hi all,

How can I do to use the "NOT IN" clause in Spark SQL 1.2 ??

He continues to give me syntax errors. But the question is correct in SQL.

Thanks in advance,
Best regards,

Pietro.



spark-csv package - output to filename.csv?

2015-09-03 Thread Ewan Leith
Using the spark-csv package or outputting to text files, you end up with files 
named:

test.csv/part-00

rather than a more user-friendly "test.csv", even if there's only 1 part file.

We can merge the files using the Hadoop merge command with something like this 
code from http://deploymentzone.com/2015/01/30/spark-and-merged-csv-files/


def merge(sc: SparkContext, srcPath: String, dstPath: String): Unit = {

val srcFileSystem = FileSystem.get(new URI(srcPath), sc.hadoopConfiguration)

val dstFileSystem = FileSystem.get(new URI(dstPath), sc.hadoopConfiguration)

dstFileSystem.delete(new Path(dstPath), true)

FileUtil.copyMerge(srcFileSystem, new Path(srcPath), dstFileSystem, new 
Path(dstPath), true, sc.hadoopConfiguration, null)

  }

but does anyone know a way without dropping down to Hadoop.fs code?

Thanks,
Ewan


RE: How to Take the whole file as a partition

2015-09-03 Thread Ewan Leith
Have a look at the sparkContext.binaryFiles, it works like wholeTextFiles but 
returns a PortableDataStream per file. It might be a workable solution though 
you'll need to handle the binary to UTF-8 or equivalent conversion

Thanks,
Ewan

From: Shuai Zheng [mailto:szheng.c...@gmail.com]
Sent: 03 September 2015 15:22
To: user@spark.apache.org
Subject: How to Take the whole file as a partition

Hi All,

I have 1000 files, from 500M to 1-2GB at this moment. And I want my spark can 
read them as partition on the file level. Which means want the FileSplit turn 
off.

I know there are some solutions, but not very good in my case:
1, I can't use WholeTextFiles method, because my file is too big, I don't want 
to risk the performance.
2, I try to use newAPIHadoopFile and turnoff the file split:

lines = 
ctx.newAPIHadoopFile(inputPath, NonSplitableTextInputFormat.class, 
LongWritable.class, Text.class, hadoopConf).values()

.map(new Function() {

@Override

public String call(Text arg0) throws Exception {

return arg0.toString();

}

});

This works for some cases, but it truncate some lines (I am not sure why, but 
it looks like there is a limit on this file reading). I have a feeling that the 
spark truncate this file on 2GB bytes. Anyway it happens (because same data has 
no issue when I use mapreduce to do the input), the spark sometimes do a trunc 
on very big file if try to read all of them.

3, I can do another way is distribute the file name as the input of the Spark 
and in function open stream to read the file directly. This is what I am 
planning to do but I think it is ugly. I want to know anyone have better 
solution for it?

BTW: the file currently in text format, but it might be parquet format later, 
that is also reason I don't like my third option.

Regards,

Shuai


Re: Problem while loading saved data

2015-09-03 Thread Ewan Leith
>From that, I'd guesd that HDFS isn't setup between the nodes, or for some 
>reason writes are defaulting to file:///path/ rather than hdfs:///path/




-- Original message--

From: Amila De Silva

Date: Thu, 3 Sep 2015 17:12

To: Ewan Leith;

Cc: user@spark.apache.org;

Subject:Re: Problem while loading saved data


Hi Ewan,

Yes, 'people.parquet' is from the first attempt and in that attempt it tried to 
save the same people.json.

It seems that the same folder is created on both the nodes and contents of the 
files are distributed between the two servers.

On the master node(this is the same node which runs IPython Notebook) this is 
what I have:

people.parquet
└── _SUCCESS

On the slave I get,
people.parquet
└── _temporary
└── 0
├── task_201509030057_4699_m_00
│   └── part-r-0-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
├── task_201509030057_4699_m_01
│   └── part-r-1-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
└── _temporary

I have zipped and attached both the folders.

On Thu, Sep 3, 2015 at 5:58 PM, Ewan Leith 
<ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote:
Your error log shows you attempting to read from 'people.parquet2' not 
‘people.parquet’ as you’ve put below, is that just from a different attempt?

Otherwise, it’s an odd one! There aren’t _SUCCESS, _common_metadata and 
_metadata files under people.parquet that you’ve listed below, which would 
normally be created when the write completes, can you show us your write output?


Thanks,
Ewan



From: Amila De Silva [mailto:jaa...@gmail.com<mailto:jaa...@gmail.com>]
Sent: 03 September 2015 05:44
To: Guru Medasani <gdm...@gmail.com<mailto:gdm...@gmail.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Problem while loading saved data

Hi Guru,

Thanks for the reply.

Yes, I checked if the file exists. But instead of a single file what I found 
was a directory having the following structure.

people.parquet
└── _temporary
└── 0
├── task_201509030057_4699_m_00
│   └── part-r-0-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
├── task_201509030057_4699_m_01
│   └── part-r-1-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
└── _temporary


On Thu, Sep 3, 2015 at 7:13 AM, Guru Medasani 
<gdm...@gmail.com<mailto:gdm...@gmail.com>> wrote:
Hi Amila,

Error says that the ‘people.parquet’ file does not exist. Can you manually 
check to see if that file exists?


Py4JJavaError: An error occurred while calling o53840.parquet.

: java.lang.AssertionError: assertion failed: No schema defined, and no Parquet 
data file or summary file found under file:/home/ubuntu/ipython/people.parquet2.



Guru Medasani
gdm...@gmail.com<mailto:gdm...@gmail.com>



On Sep 2, 2015, at 8:25 PM, Amila De Silva 
<jaa...@gmail.com<mailto:jaa...@gmail.com>> wrote:

Hi All,

I have a two node spark cluster, to which I'm connecting using IPython notebook.
To see how data saving/loading works, I simply created a dataframe using 
people.json using the Code below;

df = sqlContext.read.json("examples/src/main/resources/people.json")

Then called the following to save the dataframe as a parquet.
df.write.save("people.parquet")

Tried loading the saved dataframe using;
df2 = sqlContext.read.parquet('people.parquet');

But this simply fails giving the following exception


---

Py4JJavaError Traceback (most recent call last)

 in ()

> 1 df2 = sqlContext.read.parquet('people.parquet2');



/srv/spark/python/pyspark/sql/readwriter.pyc in parquet(self, *path)

154 [('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 
'int')]

155 """

--> 156 return 
self._df(self._jreader.parquet(_to_seq(self._sqlContext._sc, path)))

157

158 @since(1.4)



/srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
__call__(self, *args)

536 answer = self.gateway_client.send_command(command)

537 return_value = get_return_value(answer, self.gateway_client,

--> 538 self.target_id, self.name<http://self.name/>)

539

540 for temp_arg in temp_args:



/srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)

298 raise Py4JJavaError(

299 'An error occurred while calling {0}{1}{2}.\n'.

--> 300 format(target_id, '.', name), value)

301 else:

302 raise Py4JError(



Py4JJavaError: An error occurred while calling o53840.parquet.

: java.lang.AssertionError: assertion failed: No schema defined, and no Parquet 
data

RE: Problem while loading saved data

2015-09-03 Thread Ewan Leith
Your error log shows you attempting to read from 'people.parquet2' not 
‘people.parquet’ as you’ve put below, is that just from a different attempt?

Otherwise, it’s an odd one! There aren’t _SUCCESS, _common_metadata and 
_metadata files under people.parquet that you’ve listed below, which would 
normally be created when the write completes, can you show us your write output?


Thanks,
Ewan



From: Amila De Silva [mailto:jaa...@gmail.com]
Sent: 03 September 2015 05:44
To: Guru Medasani 
Cc: user@spark.apache.org
Subject: Re: Problem while loading saved data

Hi Guru,

Thanks for the reply.

Yes, I checked if the file exists. But instead of a single file what I found 
was a directory having the following structure.

people.parquet
└── _temporary
└── 0
├── task_201509030057_4699_m_00
│   └── part-r-0-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
├── task_201509030057_4699_m_01
│   └── part-r-1-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
└── _temporary


On Thu, Sep 3, 2015 at 7:13 AM, Guru Medasani 
> wrote:
Hi Amila,

Error says that the ‘people.parquet’ file does not exist. Can you manually 
check to see if that file exists?


Py4JJavaError: An error occurred while calling o53840.parquet.

: java.lang.AssertionError: assertion failed: No schema defined, and no Parquet 
data file or summary file found under file:/home/ubuntu/ipython/people.parquet2.



Guru Medasani
gdm...@gmail.com



On Sep 2, 2015, at 8:25 PM, Amila De Silva 
> wrote:

Hi All,

I have a two node spark cluster, to which I'm connecting using IPython notebook.
To see how data saving/loading works, I simply created a dataframe using 
people.json using the Code below;

df = sqlContext.read.json("examples/src/main/resources/people.json")

Then called the following to save the dataframe as a parquet.
df.write.save("people.parquet")

Tried loading the saved dataframe using;
df2 = sqlContext.read.parquet('people.parquet');

But this simply fails giving the following exception


---

Py4JJavaError Traceback (most recent call last)

 in ()

> 1 df2 = sqlContext.read.parquet('people.parquet2');



/srv/spark/python/pyspark/sql/readwriter.pyc in parquet(self, *path)

154 [('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 
'int')]

155 """

--> 156 return 
self._df(self._jreader.parquet(_to_seq(self._sqlContext._sc, path)))

157

158 @since(1.4)



/srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
__call__(self, *args)

536 answer = self.gateway_client.send_command(command)

537 return_value = get_return_value(answer, self.gateway_client,

--> 538 self.target_id, self.name)

539

540 for temp_arg in temp_args:



/srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)

298 raise Py4JJavaError(

299 'An error occurred while calling {0}{1}{2}.\n'.

--> 300 format(target_id, '.', name), value)

301 else:

302 raise Py4JError(



Py4JJavaError: An error occurred while calling o53840.parquet.

: java.lang.AssertionError: assertion failed: No schema defined, and no Parquet 
data file or summary file found under file:/home/ubuntu/ipython/people.parquet2.

   at scala.Predef$.assert(Predef.scala:179)

   at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:429)

   at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369)

   at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369)

   at scala.Option.orElse(Option.scala:257)

   at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)

   at 
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:126)

   at 
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:124)

   at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165)

   at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165)

   at scala.Option.getOrElse(Option.scala:120)

   at 

RE: spark 1.4.1 saveAsTextFile (and Parquet) is slow on emr-4.0.0

2015-09-03 Thread Ewan Leith
For those who have similar issues on EMR writing Parquet files, if you update 
mapred-site.xml with the following lines:

mapred.output.direct.EmrFileSystemtrue
mapred.output.direct.NativeS3FileSystemtrue
parquet.enable.summary-metadatafalse
spark.sql.parquet.output.committer.classorg.apache.spark.sql.parquet.DirectParquetOutputCommitter
 

Then you get Parquet files writing direct to S3 without use of temporary files 
too, and the disabled summary-metadata files which can cause a performance hit 
with writing large Parquet datasets on S3

The easiest way to add them across the cluster is via the –configurations flag 
on the “aws emr create-cluster” command

Thanks,
Ewan


From: Alexander Pivovarov [mailto:apivova...@gmail.com]
Sent: 03 September 2015 00:12
To: Neil Jonkers 
Cc: user@spark.apache.org
Subject: Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

Hi Neil

Yes! it helps!!! I do  not see _temporary in console output anymore.  
saveAsTextFile is fast now.
2015-09-02 23:07:00,022 INFO  [task-result-getter-0] scheduler.TaskSetManager 
(Logging.scala:logInfo(59)) - Finished task 18.0 in stage 0.0 (TID 18) in 4398 
ms on ip-10-0-24-103.ec2.internal (1/24)
2015-09-02 23:07:01,887 INFO  [task-result-getter-2] scheduler.TaskSetManager 
(Logging.scala:logInfo(59)) - Finished task 5.0 in stage 0.0 (TID 5) in 6282 ms 
on ip-10-0-26-14.ec2.internal (24/24)
2015-09-02 23:07:01,888 INFO  [dag-scheduler-event-loop] scheduler.DAGScheduler 
(Logging.scala:logInfo(59)) - ResultStage 0 (saveAsTextFile at :22) 
finished in 6.319 s
2015-09-02 23:07:02,123 INFO  [main] s3n.Jets3tNativeFileSystemStore 
(Jets3tNativeFileSystemStore.java:storeFile(141)) - s3.putObject foo-bar 
tmp/test40_141_24_406/_SUCCESS 0

Thank you!

On Wed, Sep 2, 2015 at 12:54 AM, Neil Jonkers 
> wrote:
Hi,
Can you set the following parameters in your mapred-site.xml file please:

mapred.output.direct.EmrFileSystemtrue
mapred.output.direct.NativeS3FileSystemtrue
You can also config this at cluster launch time with the following 
Classification via EMR console:

classification=mapred-site,properties=[mapred.output.direct.EmrFileSystem=true,mapred.output.direct.NativeS3FileSystem=true]


Thank you

On Wed, Sep 2, 2015 at 6:02 AM, Alexander Pivovarov 
> wrote:
I checked previous emr config (emr-3.8)
mapred-site.xml has the following setting
 
mapred.output.committer.classorg.apache.hadoop.mapred.DirectFileOutputCommitter
 


On Tue, Sep 1, 2015 at 7:33 PM, Alexander Pivovarov 
> wrote:
Should I use DirectOutputCommitter?
spark.hadoop.mapred.output.committer.class  
com.appsflyer.spark.DirectOutputCommitter



On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov 
> wrote:
I run spark 1.4.1 in amazom aws emr 4.0.0

For some reason spark saveAsTextFile is very slow on emr 4.0.0 in comparison to 
emr 3.8  (was 5 sec, now 95 sec)

Actually saveAsTextFile says that it's done in 4.356 sec but after that I see 
lots of INFO messages with 404 error from com.amazonaws.latency logger for next 
90 sec

spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" + 
"A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")

2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop] scheduler.DAGScheduler 
(Logging.scala:logInfo(59)) - ResultStage 5 (saveAsTextFile at :22) 
finished in 4.356 s
2015-09-01 21:16:17,637 INFO  [task-result-getter-2] cluster.YarnScheduler 
(Logging.scala:logInfo(59)) - Removed TaskSet 5.0, whose tasks have all 
completed, from pool
2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler 
(Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at :22, 
took 4.547829 s
2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem 
(S3NativeFileSystem.java:listStatus(896)) - listStatus 
s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
2015-09-01 21:16:17,651 INFO  [main] amazonaws.latency 
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], 
Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found 
(Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 
3B2F06FD11682D22), S3 Extended Request ID: 
C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ], 
ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], 
AWSRequestID=[3B2F06FD11682D22], 
ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], Exception=1, 
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, 
HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923], 
HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544], 
RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129],
2015-09-01 21:16:17,723 INFO  [main] amazonaws.latency 
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], 
ServiceName=[Amazon S3], 

[jira] [Created] (SPARK-10419) Add SQLServer JdbcDialect support for datetimeoffset types

2015-09-02 Thread Ewan Leith (JIRA)
Ewan Leith created SPARK-10419:
--

 Summary: Add SQLServer JdbcDialect support for datetimeoffset types
 Key: SPARK-10419
 URL: https://issues.apache.org/jira/browse/SPARK-10419
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1, 1.5.0
Reporter: Ewan Leith
Priority: Minor


Running JDBC connections against Microsoft SQL Server database tables, when a 
table contains a datetimeoffset column type, the following error is received:

sqlContext.read.jdbc("jdbc:sqlserver://127.0.0.1:1433;DatabaseName=testdb", 
"sampletable", prop)
java.sql.SQLException: Unsupported type -155
at 
org.apache.spark.sql.jdbc.JDBCRDD$.org$apache$spark$sql$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:100)
at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:137)
at org.apache.spark.sql.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:137)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:136)
at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:128)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:200)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:130)

Based on the JdbcDialect code for DB2 and the Microsoft SQL Server 
documentation, we should probably treat datetimeoffset types as Strings 

https://technet.microsoft.com/en-us/library/bb630289%28v=sql.105%29.aspx

We've created a small addition to JdbcDialects.scala to do this conversion, 
I'll create a pull request for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



RE: How to increase the Json parsing speed

2015-08-28 Thread Ewan Leith
Can you post roughly what you’re running as your Spark code? One issue I’ve 
seen before is that passing a directory full of files as a path 
“/path/to/files/” can be slow, while “/path/to/files/*” runs fast.

Also, if you’ve not seen it, have a look at the binaryFiles call

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext

which can be a really handy way of manipulating files without reading them all 
into memory first – it returns a PortableDataStream which you can then handle 
in your java InputStreamReader code

Ewan




From: Gavin Yue [mailto:yue.yuany...@gmail.com]
Sent: 28 August 2015 08:06
To: Sabarish Sasidharan sabarish.sasidha...@manthan.com
Cc: user user@spark.apache.org
Subject: Re: How to increase the Json parsing speed

500 each with 8GB memory.
I did the test again on the cluster.
I have 6000 files which generates 6000 tasks.  Each task takes 1.5 min to 
finish based on the Stats.
So theoretically it should take 15 mins roughly. WIth some additinal overhead, 
it totally takes 18 mins.

Based on the local file parsing test, seems simply parsing the json is fast, 
which only takes 7 secs.

So wonder where is the additional 1 min coming from.
Thanks again.


On Thu, Aug 27, 2015 at 11:44 PM, Sabarish Sasidharan 
sabarish.sasidha...@manthan.commailto:sabarish.sasidha...@manthan.com wrote:
How many executors are you using when using Spark SQL?

On Fri, Aug 28, 2015 at 12:12 PM, Sabarish Sasidharan 
sabarish.sasidha...@manthan.commailto:sabarish.sasidha...@manthan.com wrote:
I see that you are not reusing the same mapper instance in the Scala snippet.

Regards
Sab

On Fri, Aug 28, 2015 at 9:38 AM, Gavin Yue 
yue.yuany...@gmail.commailto:yue.yuany...@gmail.com wrote:
Just did some tests.
I have 6000 files, each has 14K records with 900Mb file size.  In spark sql, it 
would take one task roughly 1 min to parse.
On the local machine, using the same Jackson lib inside Spark lib. Just parse 
it.

FileInputStream fstream = new FileInputStream(testfile);
BufferedReader br = new BufferedReader(new 
InputStreamReader(fstream));
String strLine;
Long begin = System.currentTimeMillis();
 while ((strLine = br.readLine()) != null)   {
JsonNode s = mapper.readTree(strLine);
 }
System.out.println(System.currentTimeMillis() - begin);
In JDK8, it took 6270ms.
Same code in Scala, it would take 7486ms
   val begin =  java.lang.System.currentTimeMillis()
for(line - Source.fromFile(testfile).getLines())
{
  val mapper = new ObjectMapper()
  mapper.registerModule(DefaultScalaModule)
  val s = mapper.readTree(line)
}
println(java.lang.System.currentTimeMillis() - begin)

One Json record contains two fileds :  ID and List[Event].
I am guessing put all the events into List would take the left time.
Any solution to speed this up?
Thanks a lot!


On Thu, Aug 27, 2015 at 7:45 PM, Sabarish Sasidharan 
sabarish.sasidha...@manthan.commailto:sabarish.sasidha...@manthan.com wrote:

For your jsons, can you tell us what is your benchmark when running on a single 
machine using just plain Java (without Spark and Spark sql)?

Regards
Sab
On 28-Aug-2015 7:29 am, Gavin Yue 
yue.yuany...@gmail.commailto:yue.yuany...@gmail.com wrote:
Hey

I am using the Json4s-Jackson parser coming with spark and parsing roughly 80m 
records with totally size 900mb.

But the speed is slow.  It took my 50 nodes(16cores cpu,100gb mem) roughly 
30mins to parse Json to use spark sql.

Jackson has the benchmark saying parsing should be ms level.

Any way to increase speed?

I am using spark 1.4 on Hadoop 2.7 with Java 8.

Thanks a lot !
-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org




--

Architect - Big Data
Ph: +91 99805 99458tel:%2B91%2099805%2099458

Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan 
India ICT)
+++



--

Architect - Big Data
Ph: +91 99805 99458tel:%2B91%2099805%2099458

Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan 
India ICT)
+++



RE: correct use of DStream foreachRDD

2015-08-28 Thread Ewan Leith
I think what you’ll want is to carry out the .map functions before the 
foreachRDD, something like:

val lines = 
ssc.textFileStream(/stream).map(Sensor.parseSensor).map(Sensor.convertToPut)

lines.foreachRDD { rdd =
  // parse the line of data into sensor object
 rdd.saveAsHadoopDataset(jobConfig)

}

Will perform the bulk of the work in the distributed processing, before the 
results are returned to the driver for writing to HBase.

Thanks,
Ewan

From: Carol McDonald [mailto:cmcdon...@maprtech.com]
Sent: 28 August 2015 15:30
To: user user@spark.apache.org
Subject: correct use of DStream foreachRDD

I would like to make sure  that I am using the DStream  foreachRDD operation 
correctly. I would like to read from a DStream transform the input and write to 
HBase.  The code below works , but I became confused when I read Note that the 
function func is executed in the driver process ?


val lines = ssc.textFileStream(/stream)

lines.foreachRDD { rdd =
  // parse the line of data into sensor object
  val sensorRDD = rdd.map(Sensor.parseSensor)

  // convert sensor data to put object and write to HBase table column 
family data
  new PairRDDFunctions(sensorRDD.
  map(Sensor.convertToPut)).
  saveAsHadoopDataset(jobConfig)

}


RE: Driver running out of memory - caused by many tasks?

2015-08-27 Thread Ewan Leith
Are you using the Kryo serializer? If not, have a look at it, it can save a lot 
of memory during shuffles

https://spark.apache.org/docs/latest/tuning.html

I did a similar task and had various issues with the volume of data being 
parsed in one go, but that helped a lot. It looks like the main difference from 
what you're doing to me is that my input classes were just a string and a byte 
array, which I then processed once it was read into the RDD, maybe your classes 
are memory heavy?


Thanks,
Ewan

-Original Message-
From: andrew.row...@thomsonreuters.com 
[mailto:andrew.row...@thomsonreuters.com] 
Sent: 27 August 2015 11:53
To: user@spark.apache.org
Subject: Driver running out of memory - caused by many tasks?

I have a spark v.1.4.1 on YARN job where the first stage has ~149,000 tasks 
(it’s reading a few TB of data). The job itself is fairly simple - it’s just 
getting a list of distinct values:

val days = spark
  .sequenceFile(inputDir, classOf[KeyClass], classOf[ValueClass])
  .sample(withReplacement = false, fraction = 0.01)
  .map(row = row._1.getTimestamp.toString(-MM-dd))
  .distinct()
  .collect()

The cardinality of the ‘day’ is quite small - there’s only a handful. However, 
I’m frequently running into OutOfMemory issues on the driver. I’ve had it fail 
with 24GB RAM, and am currently nudging it upwards to find out where it works. 
The ratio between input and shuffle write in the distinct stage is about 
3TB:7MB. On a smaller dataset, it works without issue on a smaller (4GB) heap. 
In YARN cluster mode, I get a failure message similar to:

Container 
[pid=36844,containerID=container_e15_1438040390147_4982_01_01] is running 
beyond physical memory limits. Current usage: 27.6 GB of 27 GB physical memory 
used; 29.5 GB of 56.7 GB virtual memory used. Killing container.


Is the driver running out of memory simply due to the number of tasks, or is 
there something about the job program that’s causing it to put a lot of data 
into the driver heap and go oom? If the former, is there any general guidance 
about the amount of memory to give to the driver as a function of how many 
tasks there are?

Andrew


Selecting different levels of nested data records during one select?

2015-08-27 Thread Ewan Leith
Hello,

I'm trying to query a nested data record of the form:

root
|-- userid: string (nullable = true)
|-- datarecords: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- name: string (nullable = true)
|||-- system: boolean (nullable = true)
|||-- time: string (nullable = true)
|||-- title: string (nullable = true)

Where for each userid record, there are many datarecords elements.

I'd like to be able to run the SQL equivalent of:

select userid, name, system, time, title

and get 1 output row per nested row, each one containing the matching userid 
for that row (if that makes sense!).

the explode function seemed like the place to start, but it seems I have to 
call it individually for each nested column, then I end up with a huge number 
of results based on a Cartesian join?

Is anyone able to point me in the right direction?

Thanks,
Ewan




RE: Create column in nested structure?

2015-08-13 Thread Ewan Leith
Never mind me, I've found an email to this list from Raghavendra Pandey which 
got me what I needed

val nestedCol = struct(df(nested2.column1), df(nested2.column2), 
df(flatcolumn))
val df2 = df.select(df(nested1), nestedCol as nested2)

Thanks,
Ewan

From: Ewan Leith
Sent: 13 August 2015 15:44
To: user@spark.apache.org
Subject: Create column in nested structure?

Has anyone used withColumn (or another method) to add a column to an existing 
nested dataframe?

If I call:

df.withColumn(nested.newcolumn, df(oldcolumn))

then it just creates the new column with a . In it's name, not under the 
nested structure.

Thanks,
Ewan


Create column in nested structure?

2015-08-13 Thread Ewan Leith
Has anyone used withColumn (or another method) to add a column to an existing 
nested dataframe?

If I call:

df.withColumn(nested.newcolumn, df(oldcolumn))

then it just creates the new column with a . In it's name, not under the 
nested structure.

Thanks,
Ewan


Parquet file organisation for 100GB+ dataframes

2015-08-12 Thread Ewan Leith
Hi all,

Can anyone share their experiences working with storing and organising larger 
datasets with Spark?

I've got a dataframe stored in Parquet on Amazon S3 (using EMRFS) which has a 
fairly complex nested schema (based on JSON files), which I can query in Spark, 
but the initial setup takes a few minutes, as we've got roughly 5000 partitions 
and 150GB of compressed parquet part files.

Generally things work, but we seem to be hitting various limitations now we're 
working with 100+GB of data, such as the 2GB block size limit in Spark which 
means we need a large number of partitions, slow startup due to partition 
discovery, etc.

Storing data in one big dataframe has worked well so far, but do we need to 
look at breaking it out into multiple dataframes?

Has anyone got any advice on how to structure this?

Thanks,
Ewan



RE: Specifying the role when launching an AWS spark cluster using spark_ec2

2015-08-07 Thread Ewan Leith
You'll have a lot less hassle using the AWS EMR instances with Spark 1.4.1 for 
now, until the spark_ec2.py scripts move to Hadoop 2.7.1, at the moment I'm 
pretty sure it's only using Hadoop 2.4

The EMR setup with Spark lets you use s3:// URIs with IAM roles

Ewan

-Original Message-
From: SK [mailto:skrishna...@gmail.com] 
Sent: 06 August 2015 18:27
To: user@spark.apache.org
Subject: Specifying the role when launching an AWS spark cluster using spark_ec2

Hi,

I need to access data on S3 from another account and I have been given the IAM 
role information to access that S3 bucket. From what I understand, AWS allows 
us to attach a role to a resource at the time it is created. However, I don't 
see an option for specifying the role using the spark_ec2.py script. 
So I created a spark cluster using the default role, but I was not able to 
change its IAM role after creation through AWS console.

I see a ticket for this issue:
https://github.com/apache/spark/pull/6962 and the status is closed. 

If anyone knows how I can specify the role using spark_ec2.py, please let me 
know. I am using spark 1.4.1.

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Specifying-the-role-when-launching-an-AWS-spark-cluster-using-spark-ec2-tp24154.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Help accessing protected S3

2015-07-23 Thread Ewan Leith
I think the standard S3 driver used in Spark from the Hadoop project (S3n) 
doesn't support IAM role based authentication.

However, S3a should support it. If you're running Hadoop 2.6 via the spark-ec2 
scripts (I'm not sure what it launches with by default) try accessing your 
bucket via s3a:// URLs instead of s3n://

http://wiki.apache.org/hadoop/AmazonS3

https://issues.apache.org/jira/browse/HADOOP-10400

Thanks,
Ewan



-Original Message-
From: Greg Anderson [mailto:gregory.ander...@familysearch.org] 
Sent: 22 July 2015 18:00
To: user@spark.apache.org
Subject: Help accessing protected S3

I have a protected s3 bucket that requires a certain IAM role to access.  When 
I start my cluster using the spark-ec2 script, everything works just fine until 
I try to read from that part of s3.  Here is the command I am using:

./spark-ec2 -k KEY -i KEY_FILE.pem --additional-security-group=IAM_ROLE 
--copy-aws-credentials --zone=us-east-1e -t m1.large --worker-instances=3 
--hadoop-major-version=2.7.1 --user-data=test.sh launch my-cluster

I have read through this article: 
http://apache-spark-user-list.1001560.n3.nabble.com/S3-Bucket-Access-td16303.html

The problem seems to be very similar, but I wasn't able to find a solution in 
it for me.  I'm not sure what else to provide here, just let me know what you 
need.  Thanks in advance!
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: coalesce on dataFrame

2015-07-01 Thread Ewan Leith
It's in spark 1.4.0, or should be at least:

https://issues.apache.org/jira/browse/SPARK-6972

Ewan

-Original Message-
From: Hafiz Mujadid [mailto:hafizmujadi...@gmail.com] 
Sent: 01 July 2015 08:23
To: user@spark.apache.org
Subject: coalesce on dataFrame

How can we use coalesce(1, true) on dataFrame?


Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-on-dataFrame-tp23564.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[jira] [Created] (SPARK-8437) Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles

2015-06-18 Thread Ewan Leith (JIRA)
Ewan Leith created SPARK-8437:
-

 Summary: Using directory path without wildcard for filename slow 
for large number of files with wholeTextFiles and binaryFiles
 Key: SPARK-8437
 URL: https://issues.apache.org/jira/browse/SPARK-8437
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.4.0, 1.3.1
 Environment: Ubuntu 15.04 + local filesystem
Amazon EMR + S3 + HDFS
Reporter: Ewan Leith
Priority: Minor


When calling wholeTextFiles or binaryFiles with a directory path with 10,000s 
of files in it, Spark hangs for a few minutes before processing the files.

If you add a * to the end of the path, there is no delay.

This happens for me on Spark 1.3.1 and 1.4 on the local filesystem, HDFS, and 
on S3.

To reproduce, create a directory with 50,000 files in it, then run:


val a = sc.binaryFiles(file:/path/to/files/)
a.count()

val b = sc.binaryFiles(file:/path/to/files/*)
b.count()

and monitor the different startup times.

For example, in the spark-shell these commands are pasted in together, so the 
delay at f.count() is from 10:11:08 t- 10:13:29 to output Total input paths to 
process : 4, then until 10:15:42 to being processing files:

scala val f = sc.binaryFiles(file:/home/ewan/large/)
15/06/18 10:11:07 INFO MemoryStore: ensureFreeSpace(160616) called with 
curMem=0, maxMem=278019440
15/06/18 10:11:07 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 156.9 KB, free 265.0 MB)
15/06/18 10:11:08 INFO MemoryStore: ensureFreeSpace(17282) called with 
curMem=160616, maxMem=278019440
15/06/18 10:11:08 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 16.9 KB, free 265.0 MB)
15/06/18 10:11:08 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:40430 (size: 16.9 KB, free: 265.1 MB)
15/06/18 10:11:08 INFO SparkContext: Created broadcast 0 from binaryFiles at 
console:21
f: org.apache.spark.rdd.RDD[(String, 
org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/ 
BinaryFileRDD[0] at binaryFiles at console:21

scala f.count()
15/06/18 10:13:29 INFO FileInputFormat: Total input paths to process : 4
15/06/18 10:15:42 INFO FileInputFormat: Total input paths to process : 4
15/06/18 10:15:42 INFO CombineFileInputFormat: DEBUG: Terminated node 
allocation with : CompletedNodes: 1, size left: 0
15/06/18 10:15:42 INFO SparkContext: Starting job: count at console:24
15/06/18 10:15:42 INFO DAGScheduler: Got job 0 (count at console:24) with 
4 output partitions (allowLocal=false)
15/06/18 10:15:42 INFO DAGScheduler: Final stage: ResultStage 0(count at 
console:24)
15/06/18 10:15:42 INFO DAGScheduler: Parents of final stage: List()

Adding a * to the end of the path removes the delay:


scala val f = sc.binaryFiles(file:/home/ewan/large/*)
15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(160616) called with 
curMem=0, maxMem=278019440
15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 156.9 KB, free 265.0 MB)
15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(17309) called with 
curMem=160616, maxMem=278019440
15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 16.9 KB, free 265.0 MB)
15/06/18 10:08:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:42825 (size: 16.9 KB, free: 265.1 MB)
15/06/18 10:08:29 INFO SparkContext: Created broadcast 0 from binaryFiles at 
console:21
f: org.apache.spark.rdd.RDD[(String, 
org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/* 
BinaryFileRDD[0] at binaryFiles at console:21

scala f.count()
15/06/18 10:08:32 INFO FileInputFormat: Total input paths to process : 4
15/06/18 10:08:33 INFO FileInputFormat: Total input paths to process : 4
15/06/18 10:08:35 INFO CombineFileInputFormat: DEBUG: Terminated node 
allocation with : CompletedNodes: 1, size left: 0
15/06/18 10:08:35 INFO SparkContext: Starting job: count at console:24
15/06/18 10:08:35 INFO DAGScheduler: Got job 0 (count at console:24) with 
4 output partitions 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8437) Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles

2015-06-18 Thread Ewan Leith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591615#comment-14591615
 ] 

Ewan Leith commented on SPARK-8437:
---

Thanks, I wasn't sure if it was Hadoop or Spark specific, initially I thought 
it was S3 related but it happens all over.

If it is Hadoop, I don't know if it would be feasible for Spark to check if a 
directory has been given and add a wildcard in the background, that might not 
give the desired effect, but otherwise there's various doc changes to make.

 Using directory path without wildcard for filename slow for large number of 
 files with wholeTextFiles and binaryFiles
 -

 Key: SPARK-8437
 URL: https://issues.apache.org/jira/browse/SPARK-8437
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.3.1, 1.4.0
 Environment: Ubuntu 15.04 + local filesystem
 Amazon EMR + S3 + HDFS
Reporter: Ewan Leith
Priority: Minor

 When calling wholeTextFiles or binaryFiles with a directory path with 10,000s 
 of files in it, Spark hangs for a few minutes before processing the files.
 If you add a * to the end of the path, there is no delay.
 This happens for me on Spark 1.3.1 and 1.4 on the local filesystem, HDFS, and 
 on S3.
 To reproduce, create a directory with 50,000 files in it, then run:
 val a = sc.binaryFiles(file:/path/to/files/)
 a.count()
 val b = sc.binaryFiles(file:/path/to/files/*)
 b.count()
 and monitor the different startup times.
 For example, in the spark-shell these commands are pasted in together, so the 
 delay at f.count() is from 10:11:08 t- 10:13:29 to output Total input paths 
 to process : 4, then until 10:15:42 to being processing files:
 scala val f = sc.binaryFiles(file:/home/ewan/large/)
 15/06/18 10:11:07 INFO MemoryStore: ensureFreeSpace(160616) called with 
 curMem=0, maxMem=278019440
 15/06/18 10:11:07 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 156.9 KB, free 265.0 MB)
 15/06/18 10:11:08 INFO MemoryStore: ensureFreeSpace(17282) called with 
 curMem=160616, maxMem=278019440
 15/06/18 10:11:08 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 16.9 KB, free 265.0 MB)
 15/06/18 10:11:08 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:40430 (size: 16.9 KB, free: 265.1 MB)
 15/06/18 10:11:08 INFO SparkContext: Created broadcast 0 from binaryFiles at 
 console:21
 f: org.apache.spark.rdd.RDD[(String, 
 org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/ 
 BinaryFileRDD[0] at binaryFiles at console:21
 scala f.count()
 15/06/18 10:13:29 INFO FileInputFormat: Total input paths to process : 4
 15/06/18 10:15:42 INFO FileInputFormat: Total input paths to process : 4
 15/06/18 10:15:42 INFO CombineFileInputFormat: DEBUG: Terminated node 
 allocation with : CompletedNodes: 1, size left: 0
 15/06/18 10:15:42 INFO SparkContext: Starting job: count at console:24
 15/06/18 10:15:42 INFO DAGScheduler: Got job 0 (count at console:24) with 
 4 output partitions (allowLocal=false)
 15/06/18 10:15:42 INFO DAGScheduler: Final stage: ResultStage 0(count at 
 console:24)
 15/06/18 10:15:42 INFO DAGScheduler: Parents of final stage: List()
 Adding a * to the end of the path removes the delay:
 scala val f = sc.binaryFiles(file:/home/ewan/large/*)
 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(160616) called with 
 curMem=0, maxMem=278019440
 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 156.9 KB, free 265.0 MB)
 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(17309) called with 
 curMem=160616, maxMem=278019440
 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 16.9 KB, free 265.0 MB)
 15/06/18 10:08:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:42825 (size: 16.9 KB, free: 265.1 MB)
 15/06/18 10:08:29 INFO SparkContext: Created broadcast 0 from binaryFiles at 
 console:21
 f: org.apache.spark.rdd.RDD[(String, 
 org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/* 
 BinaryFileRDD[0] at binaryFiles at console:21
 scala f.count()
 15/06/18 10:08:32 INFO FileInputFormat: Total input paths to process : 4
 15/06/18 10:08:33 INFO FileInputFormat: Total input paths to process : 4
 15/06/18 10:08:35 INFO CombineFileInputFormat: DEBUG: Terminated node 
 allocation with : CompletedNodes: 1, size left: 0
 15/06/18 10:08:35 INFO SparkContext: Starting job: count at console:24
 15/06/18 10:08:35 INFO DAGScheduler: Got job 0 (count at console:24) with 
 4 output partitions 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332

[jira] [Comment Edited] (SPARK-8437) Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles

2015-06-18 Thread Ewan Leith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591615#comment-14591615
 ] 

Ewan Leith edited comment on SPARK-8437 at 6/18/15 10:51 AM:
-

Thanks, I wasn't sure if it was Hadoop or Spark specific, initially I thought 
it was S3 related but it happens all over.

If it is Hadoop, I don't know if it would be feasible for Spark to check if a 
directory has been given and add a wildcard in the background, that might not 
give the desired effect, but otherwise there's various doc changes to make.

I've just tried this with textFile(/path/to/files/) and got the same issue, 
so I assume it is a hadoop thing, and documentation changes might be the best 
option


was (Author: ewanleith):
Thanks, I wasn't sure if it was Hadoop or Spark specific, initially I thought 
it was S3 related but it happens all over.

If it is Hadoop, I don't know if it would be feasible for Spark to check if a 
directory has been given and add a wildcard in the background, that might not 
give the desired effect, but otherwise there's various doc changes to make.

 Using directory path without wildcard for filename slow for large number of 
 files with wholeTextFiles and binaryFiles
 -

 Key: SPARK-8437
 URL: https://issues.apache.org/jira/browse/SPARK-8437
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.3.1, 1.4.0
 Environment: Ubuntu 15.04 + local filesystem
 Amazon EMR + S3 + HDFS
Reporter: Ewan Leith
Priority: Minor

 When calling wholeTextFiles or binaryFiles with a directory path with 10,000s 
 of files in it, Spark hangs for a few minutes before processing the files.
 If you add a * to the end of the path, there is no delay.
 This happens for me on Spark 1.3.1 and 1.4 on the local filesystem, HDFS, and 
 on S3.
 To reproduce, create a directory with 50,000 files in it, then run:
 val a = sc.binaryFiles(file:/path/to/files/)
 a.count()
 val b = sc.binaryFiles(file:/path/to/files/*)
 b.count()
 and monitor the different startup times.
 For example, in the spark-shell these commands are pasted in together, so the 
 delay at f.count() is from 10:11:08 t- 10:13:29 to output Total input paths 
 to process : 4, then until 10:15:42 to being processing files:
 scala val f = sc.binaryFiles(file:/home/ewan/large/)
 15/06/18 10:11:07 INFO MemoryStore: ensureFreeSpace(160616) called with 
 curMem=0, maxMem=278019440
 15/06/18 10:11:07 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 156.9 KB, free 265.0 MB)
 15/06/18 10:11:08 INFO MemoryStore: ensureFreeSpace(17282) called with 
 curMem=160616, maxMem=278019440
 15/06/18 10:11:08 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 16.9 KB, free 265.0 MB)
 15/06/18 10:11:08 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:40430 (size: 16.9 KB, free: 265.1 MB)
 15/06/18 10:11:08 INFO SparkContext: Created broadcast 0 from binaryFiles at 
 console:21
 f: org.apache.spark.rdd.RDD[(String, 
 org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/ 
 BinaryFileRDD[0] at binaryFiles at console:21
 scala f.count()
 15/06/18 10:13:29 INFO FileInputFormat: Total input paths to process : 4
 15/06/18 10:15:42 INFO FileInputFormat: Total input paths to process : 4
 15/06/18 10:15:42 INFO CombineFileInputFormat: DEBUG: Terminated node 
 allocation with : CompletedNodes: 1, size left: 0
 15/06/18 10:15:42 INFO SparkContext: Starting job: count at console:24
 15/06/18 10:15:42 INFO DAGScheduler: Got job 0 (count at console:24) with 
 4 output partitions (allowLocal=false)
 15/06/18 10:15:42 INFO DAGScheduler: Final stage: ResultStage 0(count at 
 console:24)
 15/06/18 10:15:42 INFO DAGScheduler: Parents of final stage: List()
 Adding a * to the end of the path removes the delay:
 scala val f = sc.binaryFiles(file:/home/ewan/large/*)
 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(160616) called with 
 curMem=0, maxMem=278019440
 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 156.9 KB, free 265.0 MB)
 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(17309) called with 
 curMem=160616, maxMem=278019440
 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 16.9 KB, free 265.0 MB)
 15/06/18 10:08:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:42825 (size: 16.9 KB, free: 265.1 MB)
 15/06/18 10:08:29 INFO SparkContext: Created broadcast 0 from binaryFiles at 
 console:21
 f: org.apache.spark.rdd.RDD[(String, 
 org.apache.spark.input.PortableDataStream)] = file:/home

RE: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Ewan Leith
Try putting a * on the end of xmlDir, i.e.

xmlDir = fdfs:///abc/def/*

Rather than

xmlDir = Hdfs://abc/def

and see what happens. I don't know why, but that appears to be more reliable 
for me with S3 as the filesystem.

I'm also using binaryFiles, but I've tried running the same command while 
wholeTextFiles and had the same error.

Ewan

-Original Message-
From: Kostas Kougios [mailto:kostas.koug...@googlemail.com] 
Sent: 08 June 2015 15:02
To: user@spark.apache.org
Subject: spark timesout maybe due to binaryFiles() with more than 1 million 
files in HDFS

I am reading millions of xml files via

val xmls = sc.binaryFiles(xmlDir)

The operation runs fine locally but on yarn it fails with:

 client token: N/A
 diagnostics: Application application_1433491939773_0012 failed 2 times due to 
ApplicationMaster for attempt appattempt_1433491939773_0012_02 timed out. 
Failing the application.
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1433750951883
 final status: FAILED
 tracking URL:
http://controller01:8088/cluster/app/application_1433491939773_0012
 user: ariskk
Exception in thread main org.apache.spark.SparkException: Application 
finished with failed status at 
org.apache.spark.deploy.yarn.Client.run(Client.scala:622)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:647)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

On hadoops/userlogs logs I am frequently getting these messages:

15/06/08 09:15:38 WARN util.AkkaUtils: Error sending message [message = 
Heartbeat(1,[Lscala.Tuple2;@2b4f336b,BlockManagerId(1,
controller01.stratified, 58510))] in 2 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at 
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)

I run my spark job via spark-submit and it works for an other HDFS directory 
that contains only 37k files. Any ideas how to resolve this?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-timesout-maybe-due-to-binaryFiles-with-more-than-1-million-files-in-HDFS-tp23208.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Ewan Leith
Can you do a simple

sc.binaryFiles(hdfs:///path/to/files/*).count()

in the spark-shell and verify that part works?

Ewan



-Original Message-
From: Konstantinos Kougios [mailto:kostas.koug...@googlemail.com] 
Sent: 08 June 2015 15:40
To: Ewan Leith; user@spark.apache.org
Subject: Re: spark timesout maybe due to binaryFiles() with more than 1 million 
files in HDFS

No luck I am afraid. After giving the namenode 16GB of RAM, I am still getting 
an out of mem exception, kind of different one:

15/06/08 15:35:52 ERROR yarn.ApplicationMaster: User class threw
exception: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
 at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1351)
 at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1413)
 at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1524)
 at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1533)
 at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:557)
 at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
 at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
 at com.sun.proxy.$Proxy10.getListing(Unknown Source)
 at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1969)
 at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1952)
 at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:724)
 at
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
 at
org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
 at
org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
 at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
 at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69)
 at org.apache.hadoop.fs.Globber.glob(Globber.java:217)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1644)
 at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:292)
 at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
 at
org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:47)
 at
org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:43)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)


and on the 2nd retry of spark, a similar exception:

java.lang.OutOfMemoryError: GC overhead limit exceeded
 at
com.google.protobuf.LiteralByteString.toString(LiteralByteString.java:148)
 at com.google.protobuf.ByteString.toStringUtf8(ByteString.java:572)
 at
org.apache.hadoop.hdfs.protocol.proto.HdfsProtos$HdfsFileStatusProto.getOwner(HdfsProtos.java:21558)
 at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1413)
 at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1524)
 at
org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1533)
 at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:557)
 at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
 at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
 at com.sun.proxy.$Proxy10.getListing(Unknown Source)
 at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1969)
 at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1952)
 at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:724)
 at
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
 at
org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755

RE: redshift spark

2015-06-05 Thread Ewan Leith
That project is for reading data in from Redshift table exports stored in s3 by 
running commands in redshift like this:

unload ('select * from venue')   
to 's3://mybucket/tickit/unload/'

http://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html

The path in the parameters below is the s3 bucket path.

Hope this helps,
Ewan

-Original Message-
From: Hafiz Mujadid [mailto:hafizmujadi...@gmail.com] 
Sent: 05 June 2015 15:25
To: user@spark.apache.org
Subject: redshift spark

Hi All,

I want to read and write data to aws redshift. I found spark-redshift project 
at following address.
https://github.com/databricks/spark-redshift

in its documentation there is following code is written. 
import com.databricks.spark.redshift.RedshiftInputFormat

val records = sc.newAPIHadoopFile(
  path,
  classOf[RedshiftInputFormat],
  classOf[java.lang.Long],
  classOf[Array[String]])

I am unable to understand it's parameters. Can somebody explain how to use 
this? what is meant by path in this case?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/redshift-spark-tp23175.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Ewan Leith
Hi all,

I might be missing something, but does the new Spark 1.3 sqlContext save 
interface support using Avro as the schema structure when writing Parquet 
files, in a similar way to AvroParquetWriter (which I've got working)?

I've seen how you can load an avro file and save it as parquet from 
https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html,
 but not using the 2 together.

Thanks, and apologies if I've missed something obvious!

Ewan


RE: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Ewan Leith
Thanks Cheng, that's brilliant, you've saved me a headache.

Ewan

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: 19 May 2015 11:58
To: Ewan Leith; user@spark.apache.org
Subject: Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or 
createDataFrame Interfaces?

That's right. Also, Spark SQL can automatically infer schema from JSON 
datasets. You don't need to specify an Avro schema:

   sqlContext.jsonFile(json/path).saveAsParquetFile(parquet/path)

or with the new reader/writer API introduced in 1.4-SNAPSHOT:

   sqlContext.read.json(json/path).write.parquet(parquet/path)

Cheng
On 5/19/15 6:07 PM, Ewan Leith wrote:
Thanks Cheng, that makes sense.

So for new dataframe creation (not conversion from Avro but from JSON or CSV 
inputs) in Spark we shouldn't worry about using Avro at all, just use the Spark 
SQL StructType when building new Dataframes? If so, that will be a lot simpler!

Thanks,
Ewan

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: 19 May 2015 11:01
To: Ewan Leith; user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or 
createDataFrame Interfaces?

Hi Ewan,

Different from AvroParquetWriter, in Spark SQL we uses StructType as the 
intermediate schema format. So when converting Avro files to Parquet files, we 
internally converts Avro schema to Spark SQL StructType first, and then convert 
StructType to Parquet schema.

Cheng
On 5/19/15 4:42 PM, Ewan Leith wrote:
Hi all,

I might be missing something, but does the new Spark 1.3 sqlContext save 
interface support using Avro as the schema structure when writing Parquet 
files, in a similar way to AvroParquetWriter (which I've got working)?

I've seen how you can load an avro file and save it as parquet from 
https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html,
 but not using the 2 together.

Thanks, and apologies if I've missed something obvious!

Ewan




RE: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Ewan Leith
Thanks Cheng, that makes sense.

So for new dataframe creation (not conversion from Avro but from JSON or CSV 
inputs) in Spark we shouldn't worry about using Avro at all, just use the Spark 
SQL StructType when building new Dataframes? If so, that will be a lot simpler!

Thanks,
Ewan

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: 19 May 2015 11:01
To: Ewan Leith; user@spark.apache.org
Subject: Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or 
createDataFrame Interfaces?

Hi Ewan,

Different from AvroParquetWriter, in Spark SQL we uses StructType as the 
intermediate schema format. So when converting Avro files to Parquet files, we 
internally converts Avro schema to Spark SQL StructType first, and then convert 
StructType to Parquet schema.

Cheng
On 5/19/15 4:42 PM, Ewan Leith wrote:
Hi all,

I might be missing something, but does the new Spark 1.3 sqlContext save 
interface support using Avro as the schema structure when writing Parquet 
files, in a similar way to AvroParquetWriter (which I've got working)?

I've seen how you can load an avro file and save it as parquet from 
https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html,
 but not using the 2 together.

Thanks, and apologies if I've missed something obvious!

Ewan



Re: [nodejs] Unable to compile node v0.8 RC7 on ARM (Beaglebone)

2012-06-22 Thread Ewan Leith
Thanks Ben, I added armv7 then it started complaining about arm_neon, so 
I've added that too, to deps/v8/build/common.gypi

I've added these 2 lines at the top, just inside the variables section:

'armv7%':'1',
'arm_neon%':'1',

configure completes now, so I'll try a straight make without any additional 
compiler flags and see what happens.

Thanks,
Ewan

On Friday, 22 June 2012 14:51:29 UTC+1, Ben Noordhuis wrote:

 On Fri, Jun 22, 2012 at 3:39 PM, Ewan Leith wrote: 
  Hi all, I'm trying to compile node v0.8 rc7 on my beaglebone, and the 
  configure script it falling over. It does work fine on the 0.6 branch. 
 Has 
  anyone built node v0.8 or v0.7 on ARM? 
  
  Running configure (with or without -det-cpu=arm) gives the following 
 output: 
  
  ./configure --dest-cpu=arm 
  { 'target_defaults': { 'cflags': [], 
 'default_configuration': 'Release', 
 'defines': [], 
 'include_dirs': [], 
 'libraries': []}, 
'variables': { 'host_arch': 'arm', 
   'node_install_npm': 'true', 
   'node_install_waf': 'true', 
   'node_prefix': '', 
   'node_shared_openssl': 'false', 
   'node_shared_v8': 'false', 
   'node_shared_zlib': 'false', 
   'node_use_dtrace': 'false', 
   'node_use_etw': 'false', 
   'node_use_openssl': 'true', 
   'strict_aliasing': 'true', 
   'target_arch': 'arm', 
   'v8_use_snapshot': 'true'}} 
  creating  ./config.gypi 
  creating  ./config.mk 
  Traceback (most recent call last): 
File tools/gyp_node, line 58, in module 
  run_gyp(gyp_args) 
File tools/gyp_node, line 18, in run_gyp 
  rc = gyp.main(args) 
File ./tools/gyp/pylib/gyp/__init__.py, line 471, in main 
  options.circular_check) 
File ./tools/gyp/pylib/gyp/__init__.py, line 111, in Load 
  depth, generator_input_info, check, circular_check) 
File ./tools/gyp/pylib/gyp/input.py, line 2289, in Load 
  depth, check) 
File ./tools/gyp/pylib/gyp/input.py, line 433, in 
 LoadTargetBuildFile 
  includes, depth, check) 
File ./tools/gyp/pylib/gyp/input.py, line 387, in 
 LoadTargetBuildFile 
  build_file_path) 
File ./tools/gyp/pylib/gyp/input.py, line 984, in 
  ProcessVariablesAndConditionsInDict 
  ProcessConditionsInDict(the_dict, is_late, variables, build_file) 
File ./tools/gyp/pylib/gyp/input.py, line 861, in 
  ProcessConditionsInDict 
  variables, build_file) 
File ./tools/gyp/pylib/gyp/input.py, line 1010, in 
  ProcessVariablesAndConditionsInDict 
  build_file) 
File ./tools/gyp/pylib/gyp/input.py, line 1025, in 
  ProcessVariablesAndConditionsInList 
  ProcessVariablesAndConditionsInDict(item, is_late, variables, 
  build_file) 
File ./tools/gyp/pylib/gyp/input.py, line 1010, in 
  ProcessVariablesAndConditionsInDict 
  build_file) 
File ./tools/gyp/pylib/gyp/input.py, line 1025, in 
  ProcessVariablesAndConditionsInList 
  ProcessVariablesAndConditionsInDict(item, is_late, variables, 
  build_file) 
File ./tools/gyp/pylib/gyp/input.py, line 984, in 
  ProcessVariablesAndConditionsInDict 
  ProcessConditionsInDict(the_dict, is_late, variables, build_file) 
File ./tools/gyp/pylib/gyp/input.py, line 861, in 
  ProcessConditionsInDict 
  variables, build_file) 
File ./tools/gyp/pylib/gyp/input.py, line 984, in 
  ProcessVariablesAndConditionsInDict 
  ProcessConditionsInDict(the_dict, is_late, variables, build_file) 
File ./tools/gyp/pylib/gyp/input.py, line 842, in 
  ProcessConditionsInDict 
  if eval(ast_code, {'__builtins__': None}, variables): 
File string, line 1, in module 
  NameError: name 'armv7' is not defined while evaluating condition 
 'armv7==1' 
  in /tmp/node-v0.8.0/deps/v8/tools/gyp/v8.gyp while loading dependencies 
 of 
  /tmp/node-v0.8.0/node.gyp while trying to load /tmp/node-v0.8.0/node.gyp 
  
  
  Hacking deps/v8/tools/gyp/v8.gyp to just remove the condition that 
 it's 
  complaining about at line 997, lets the build continue and complete, but 
 it 
  then crashes when built (probably because I'm excluding relevant 
 switches to 
  gcc by changing the configure output). 
  
  The section of v8.gyp is around line 149: 
  
'conditions': [ 
  ['armv7==1', { 
  
  Has anyone built node v0.8 or v0.7 on ARM or have some ideas of how to 
  resolve this? 

 Gah, I'm still waiting for my BeagleBone. 

 Does it work when you add `'armv7%':'1'` (sans backticks) to the 
 variables section of config.gypi or common.gypi? 


-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups nodejs group.
To post

Re: [nodejs] Unable to compile node v0.8 RC7 on ARM (Beaglebone)

2012-06-22 Thread Ewan Leith
The compile completes with those 2 additional lines to 
deps/v8/build/common.gypi but running make test raises various errors like

[00:44|%   7|+  27|-   4]: release test-tls-npn-server-client*** glibc 
detected *** out/Release/node: free(): invalid pointer: 0x00717308 ***


so I'll take a look and see if I can work out what's going on.

Thanks,
Ewan




On Friday, 22 June 2012 15:44:59 UTC+1, Tim Caswell wrote:

 Also, I've recently noticed that archlinux ARM has nodejs in their 
 repository.  It's usually quite up to date.  They support many arm devices 
 including the beaglebone and the raspberry pi.  Even if you don't want to 
 switch to archlinux or just want a newer node than the one they are 
 packaging, their PKGBUILD https://wiki.archlinux.org/index.php/PKGBUILD 
 should 
 be up somewhere to see how it's done for each platform.

 http://archlinuxarm.org/platforms/armv7/beaglebone
 http://archlinuxarm.org/packages - search for armv7 and nodejs. 
  Currently it has v0.6.19
 http://archlinuxarm.org/developers/building-packages

 On Fri, Jun 22, 2012 at 9:06 AM, Ewan Leith wrote:

 Thanks Ben, I added armv7 then it started complaining about arm_neon, so 
 I've added that too, to deps/v8/build/common.gypi

 I've added these 2 lines at the top, just inside the variables section:

 'armv7%':'1',
 'arm_neon%':'1',

 configure completes now, so I'll try a straight make without any 
 additional compiler flags and see what happens.

 Thanks,
 Ewan

 On Friday, 22 June 2012 14:51:29 UTC+1, Ben Noordhuis wrote:

 On Fri, Jun 22, 2012 at 3:39 PM, Ewan Leith wrote: 
  Hi all, I'm trying to compile node v0.8 rc7 on my beaglebone, and the 
  configure script it falling over. It does work fine on the 0.6 branch. 
 Has 
  anyone built node v0.8 or v0.7 on ARM? 
  
  Running configure (with or without -det-cpu=arm) gives the following 
 output: 
  
  ./configure --dest-cpu=arm 
  { 'target_defaults': { 'cflags': [], 
 'default_configuration': 'Release', 
 'defines': [], 
 'include_dirs': [], 
 'libraries': []}, 
'variables': { 'host_arch': 'arm', 
   'node_install_npm': 'true', 
   'node_install_waf': 'true', 
   'node_prefix': '', 
   'node_shared_openssl': 'false', 
   'node_shared_v8': 'false', 
   'node_shared_zlib': 'false', 
   'node_use_dtrace': 'false', 
   'node_use_etw': 'false', 
   'node_use_openssl': 'true', 
   'strict_aliasing': 'true', 
   'target_arch': 'arm', 
   'v8_use_snapshot': 'true'}} 
  creating  ./config.gypi 
  creating  ./config.mk 
  Traceback (most recent call last): 
File tools/gyp_node, line 58, in module 
  run_gyp(gyp_args) 
File tools/gyp_node, line 18, in run_gyp 
  rc = gyp.main(args) 
File ./tools/gyp/pylib/gyp/__init_**_.py, line 471, in main 
  options.circular_check) 
File ./tools/gyp/pylib/gyp/__init_**_.py, line 111, in Load 
  depth, generator_input_info, check, circular_check) 
File ./tools/gyp/pylib/gyp/input.**py, line 2289, in Load 
  depth, check) 
File ./tools/gyp/pylib/gyp/input.**py, line 433, in 
 LoadTargetBuildFile 
  includes, depth, check) 
File ./tools/gyp/pylib/gyp/input.**py, line 387, in 
 LoadTargetBuildFile 
  build_file_path) 
File ./tools/gyp/pylib/gyp/input.**py, line 984, in 
  ProcessVariablesAndConditionsI**nDict 
  ProcessConditionsInDict(the_**dict, is_late, variables, 
 build_file) 
File ./tools/gyp/pylib/gyp/input.**py, line 861, in 
  ProcessConditionsInDict 
  variables, build_file) 
File ./tools/gyp/pylib/gyp/input.**py, line 1010, in 
  ProcessVariablesAndConditionsI**nDict 
  build_file) 
File ./tools/gyp/pylib/gyp/input.**py, line 1025, in 
  ProcessVariablesAndConditionsI**nList 
  ProcessVariablesAndConditionsI**nDict(item, is_late, variables, 
  build_file) 
File ./tools/gyp/pylib/gyp/input.**py, line 1010, in 
  ProcessVariablesAndConditionsI**nDict 
  build_file) 
File ./tools/gyp/pylib/gyp/input.**py, line 1025, in 
  ProcessVariablesAndConditionsI**nList 
  ProcessVariablesAndConditionsI**nDict(item, is_late, variables, 
  build_file) 
File ./tools/gyp/pylib/gyp/input.**py, line 984, in 
  ProcessVariablesAndConditionsI**nDict 
  ProcessConditionsInDict(the_**dict, is_late, variables, 
 build_file) 
File ./tools/gyp/pylib/gyp/input.**py, line 861, in 
  ProcessConditionsInDict 
  variables, build_file) 
File ./tools/gyp/pylib/gyp/input.**py, line 984, in 
  ProcessVariablesAndConditionsI**nDict 
  ProcessConditionsInDict(the_**dict, is_late, variables, 
 build_file) 
File ./tools/gyp/pylib/gyp/input.**py, line 842, in 
  ProcessConditionsInDict 
  if eval(ast_code, {'__builtins__': None}, variables

Suffix authentication in users file

2002-11-25 Thread Ewan Leith
Ive just been trying to get freeradius working instead of citron radius, 
but I've ran into a problem with the suffix parameter setting in 
/etc/raddb/users.

My understanding of the Suffix was that:

DEFAULT Suffix == NC, Auth-Type := System
 Service-Type = Framed-User,
 Framed-Protocol = PPP,
 Framed-IP-Address = 255.255.255.254,
 Framed-IP-Netmask = 255.0.0.0,
 Framed-Routing = Broadcast-Listen,
 Framed-Filter-Id = std.ppp,
 Framed-Compression = Van-Jacobsen-TCP-IP

would authenticate against the system with the username minus the NC
suffix, but this doesn't seem to be happening, and the username is being
passed in its entirity. I've found a Strip-User-Name setting but that
just seems to exist for the hints files.

Running radiusd -X I get
rad_recv: Access-Request packet from host 127.0.0.1:62037, id=39, length=61
 User-Name = testNC
 User-Password = password
 NAS-IP-Address = 255.255.255.255
 NAS-Port = 15
modcall: entering group authorize
   modcall[authorize]: module preprocess returns ok
rlm_chap: Could not find proper Chap-Password attribute in request
   modcall[authorize]: module chap returns noop
   modcall[authorize]: module mschap returns notfound
 rlm_realm: No '@' in User-Name = testNC, looking up realm NULL
 rlm_realm: No such realm NULL
   modcall[authorize]: module suffix returns noop
 users: Matched DEFAULT at 150
   modcall[authorize]: module files returns ok
modcall: group authorize returns ok
   rad_check_password:  Found Auth-Type System
auth: type System
modcall: entering group authenticate
   modcall[authenticate]: module unix returns notfound
modcall: group authenticate returns notfound
auth: Failed to validate the user.

Have i missed something? Im sure this used to work as I expect it in
cistron radius. Do I need to alter something to pass only the username
and not the suffix, is it even possible?

Im running AIX 4.3.3 and freeradius 0.80

Thanks,
Ewan


- 
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html


Re: Suffix authentication in users file

2002-11-25 Thread Ewan Leith
works perfectly thanks, obvious when you think about it i suppose :)

Ewan

Chris Parker wrote:


Yes, so use the 'hints' file as the documentation at the beginning of
the hints file tells you how to do exactly what you are looking for.

-Chris
--



- 
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html


[ADMIN] Moving a database

2001-12-10 Thread Ewan Leith

Hi all,

we recently upgraded from 6.53 to 7.1.2, and at the same time tried to move
the database to a new filesystem.

However, while the upgrade was 100% successful using pg_dumpall, we see that
postgres is still reading some files from the old file system (though only
updating the new files).

An example is pg_shadow which is read on both file systems whenever someone
seems to authenticate, but only updated on the new file system.

Does anyone have any ideas? Is it possible to move the location of Postgres
in this manner?

Thanks,
Ewan

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org