[jira] [Commented] (SPARK-37185) DataFrame.take() only uses one worker

2021-11-02 Thread mathieu longtin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437585#comment-17437585
 ] 

mathieu longtin commented on SPARK-37185:
-

It seems to try to optimize for a simple query, but not more complex queries. 
It kind of make sense for "select * from t", but any where clause can make it 
quite restrictive.

It looks like it scans the first part, doesn't find enough data, then scans 
four parts, then decides to scan everything. This is nice, but meanwhile, I 
have 20 workers already reserved, it wouldn't cost anything more to just go 
ahead right away.

Timing, table is not cached, contains 69 csv.gz files with anywhere from 1MB to 
2.2GB of data:
{code:java}
In [1]: %time spark.sql("select * from t where x = 99").take(10)
CPU times: user 83.9 ms, sys: 112 ms, total: 196 ms
Wall time: 6min 44s
...
In [2]: %time spark.sql("select * from t where x = 99").limit(10).rdd.collect()
CPU times: user 45.7 ms, sys: 73.9 ms, total: 120 ms
Wall time: 3min 59s
...


{code}
I ran the two tests a few times to make sure there was no OS level caching 
effect, the timing didn't change much.

If I cache the table first, then "take(10)" is faster than 
"limit(10).rdd.collect()".

> DataFrame.take() only uses one worker
> -
>
> Key: SPARK-37185
> URL: https://issues.apache.org/jira/browse/SPARK-37185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
> Environment: CentOS 7
>Reporter: mathieu longtin
>Priority: Major
>
> Say you have query:
> {code:java}
> >>> df = spark.sql("select * from mytable where x = 99"){code}
> Now, out of billions of row, there's only ten rows where x is 99.
> If I do:
> {code:java}
> >>> df.limit(10).collect()
> [Stage 1:>  (0 + 1) / 1]{code}
> It only uses one worker. This takes a really long time since one CPU is 
> reading the billions of row.
> However, if I do this:
> {code:java}
> >>> df.limit(10).rdd.collect()
> [Stage 1:>  (0 + 10) / 22]{code}
> All the workers are running.
> I think there's some optimization issue DataFrame.take(...).
> This did not use to be an issue, but I'm not sure if it was working with 3.0 
> or 2.4.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37185) DataFrame.take() only uses one worker

2021-11-01 Thread mathieu longtin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437009#comment-17437009
 ] 

mathieu longtin commented on SPARK-37185:
-

Additional note: if there's a "group by" in the query, this is not an issue.

> DataFrame.take() only uses one worker
> -
>
> Key: SPARK-37185
> URL: https://issues.apache.org/jira/browse/SPARK-37185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
> Environment: CentOS 7
>Reporter: mathieu longtin
>Priority: Major
>
> Say you have query:
> {code:java}
> >>> df = spark.sql("select * from mytable where x = 99"){code}
> Now, out of billions of row, there's only ten rows where x is 99.
> If I do:
> {code:java}
> >>> df.limit(10).collect()
> [Stage 1:>  (0 + 1) / 1]{code}
> It only uses one worker. This takes a really long time since one CPU is 
> reading the billions of row.
> However, if I do this:
> {code:java}
> >>> df.limit(10).rdd.collect()
> [Stage 1:>  (0 + 10) / 22]{code}
> All the workers are running.
> I think there's some optimization issue DataFrame.take(...).
> This did not use to be an issue, but I'm not sure if it was working with 3.0 
> or 2.4.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37185) DataFrame.take() only uses one worker

2021-11-01 Thread mathieu longtin (Jira)
mathieu longtin created SPARK-37185:
---

 Summary: DataFrame.take() only uses one worker
 Key: SPARK-37185
 URL: https://issues.apache.org/jira/browse/SPARK-37185
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 3.1.1
 Environment: CentOS 7
Reporter: mathieu longtin


Say you have query:
{code:java}
>>> df = spark.sql("select * from mytable where x = 99"){code}
Now, out of billions of row, there's only ten rows where x is 99.

If I do:
{code:java}
>>> df.limit(10).collect()
[Stage 1:>  (0 + 1) / 1]{code}
It only uses one worker. This takes a really long time since one CPU is reading 
the billions of row.

However, if I do this:
{code:java}
>>> df.limit(10).rdd.collect()
[Stage 1:>  (0 + 10) / 22]{code}
All the workers are running.

I think there's some optimization issue DataFrame.take(...).

This did not use to be an issue, but I'm not sure if it was working with 3.0 or 
2.4.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24753) bad backslah parsing in SQL statements

2018-07-10 Thread mathieu longtin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539023#comment-16539023
 ] 

mathieu longtin edited comment on SPARK-24753 at 7/10/18 5:59 PM:
--

Thanks for the response. Yes, it does work with escapedStringLiterals.

However, this is an inconsitent behavior. In the doc example:
 ([https://spark.apache.org/docs/2.3.0/api/sql/index.html#rlike])

 
{code:java}
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*'
{code}
The examples are totally wrong. In fact, they produce an error. To reproduce 
the example, using the *spark-sql* command with _escapedStringLiterals=False_, 
I need this:

 
{code:java}
When spark.sql.parser.escapedStringLiterals is disabled (default). 
> SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%Users.*' 
true{code}
Notice the double and quadruple backslash. Somehow, the right side of rlike 
gets decoded, and then passed to the rlike function, which then decodes it 
again.

BTW, from spark-sql, single backslash are no good:
{code:java}
> SELECT '%SystemDrive%\Users\John' ;
%SystemDrive%UsersJohn
{code}
Oops, the backslash get swallowed.

 

 


was (Author: mathieulongtin):
Thanks for the response. Yes, it does work with escapedStringLiterals.

However, this is an inconsitent behavior. In the doc example:
(https://spark.apache.org/docs/2.3.0/api/sql/index.html#rlike)

 
{code:java}
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*'
{code}
The examples are totally wrong. In fact, they produce an error. To reproduce 
the example, using the *spark-sql* command with _escapedStringLiterals=False_, 
I need this:

 
{code:java}
When spark.sql.parser.escapedStringLiterals is disabled (default). 
> SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%Users.*' 
true{code}
Notice the double and quadruple backslash. Somehow, the right side of rlike 
gets decoded, and then passed to the rlike function, which then decodes it 
again.

BTW, from spark-sql:

 
{code:java}
> SELECT '%SystemDrive%\Users\John' ;

%SystemDrive%UsersJohn
{code}
 

Oops, the backslash get swallowed.

 

 

> bad backslah parsing in SQL statements
> --
>
> Key: SPARK-24753
> URL: https://issues.apache.org/jira/browse/SPARK-24753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment:     __
>  / __/__  ___ _/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 2.3.0
>   /_/
> Using Python version 2.7.12 (default, Jul 15 2016 11:23:12)
>Reporter: mathieu longtin
>Priority: Minor
>
> When putting backslashes in SQL code, you need to double them (or rather 
> double double them).
> Code in Python but I verified the problem is the same in Scala.
> Line  [3] should return the line, and line 4 shouldn't.
>  
> {code:java}
> In [1]: df = spark.createDataFrame([("abc def ghi",)], schema=["s"])
> In [2]: df.filter(df.s.rlike('\\bdef\\b')).show()
> +---+
> |  s|
> +---+
> |abc def ghi|
> +---+
> In [3]: df.filter("s rlike '\\bdef\\b'").show()
> +---+
> |  s|
> +---+
> +---+
> In [4]: df.filter("s rlike 'bdefb'").show()
> +---+
> |  s|
> +---+
> |abc def ghi|
> +---+
>  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24753) bad backslah parsing in SQL statements

2018-07-10 Thread mathieu longtin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539023#comment-16539023
 ] 

mathieu longtin commented on SPARK-24753:
-

Thanks for the response. Yes, it does work with escapedStringLiterals.

However, this is an inconsitent behavior. In the doc example:
(https://spark.apache.org/docs/2.3.0/api/sql/index.html#rlike)

 
{code:java}
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*'
{code}
The examples are totally wrong. In fact, they produce an error. To reproduce 
the example, using the *spark-sql* command with _escapedStringLiterals=False_, 
I need this:

 
{code:java}
When spark.sql.parser.escapedStringLiterals is disabled (default). 
> SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%Users.*' 
true{code}
Notice the double and quadruple backslash. Somehow, the right side of rlike 
gets decoded, and then passed to the rlike function, which then decodes it 
again.

BTW, from spark-sql:

 
{code:java}
> SELECT '%SystemDrive%\Users\John' ;

%SystemDrive%UsersJohn
{code}
 

Oops, the backslash get swallowed.

 

 

> bad backslah parsing in SQL statements
> --
>
> Key: SPARK-24753
> URL: https://issues.apache.org/jira/browse/SPARK-24753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment:     __
>  / __/__  ___ _/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 2.3.0
>   /_/
> Using Python version 2.7.12 (default, Jul 15 2016 11:23:12)
>Reporter: mathieu longtin
>Priority: Minor
>
> When putting backslashes in SQL code, you need to double them (or rather 
> double double them).
> Code in Python but I verified the problem is the same in Scala.
> Line  [3] should return the line, and line 4 shouldn't.
>  
> {code:java}
> In [1]: df = spark.createDataFrame([("abc def ghi",)], schema=["s"])
> In [2]: df.filter(df.s.rlike('\\bdef\\b')).show()
> +---+
> |  s|
> +---+
> |abc def ghi|
> +---+
> In [3]: df.filter("s rlike '\\bdef\\b'").show()
> +---+
> |  s|
> +---+
> +---+
> In [4]: df.filter("s rlike 'bdefb'").show()
> +---+
> |  s|
> +---+
> |abc def ghi|
> +---+
>  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24753) bad backslah parsing in SQL statements

2018-07-06 Thread mathieu longtin (JIRA)
mathieu longtin created SPARK-24753:
---

 Summary: bad backslah parsing in SQL statements
 Key: SPARK-24753
 URL: https://issues.apache.org/jira/browse/SPARK-24753
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment:     __
 / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0
  /_/

Using Python version 2.7.12 (default, Jul 15 2016 11:23:12)
Reporter: mathieu longtin


When putting backslashes in SQL code, you need to double them (or rather double 
double them).

Code in Python but I verified the problem is the same in Scala.

Line  [3] should return the line, and line 4 shouldn't.

 
{code:java}
In [1]: df = spark.createDataFrame([("abc def ghi",)], schema=["s"])
In [2]: df.filter(df.s.rlike('\\bdef\\b')).show()
+---+
|  s|
+---+
|abc def ghi|
+---+

In [3]: df.filter("s rlike '\\bdef\\b'").show()
+---+
|  s|
+---+
+---+

In [4]: df.filter("s rlike 'bdefb'").show()
+---+
|  s|
+---+
|abc def ghi|
+---+
 
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15531) spark-class tries to use too much memory when running Launcher

2016-05-25 Thread mathieu longtin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300986#comment-15300986
 ] 

mathieu longtin commented on SPARK-15531:
-

Correct, on a 128G server, just running {{java}} with no argument will try to 
allocate 32G, regardless of ulimit. It's "expected behavior" according to 
Oracle.

> spark-class tries to use too much memory when running Launcher
> --
>
> Key: SPARK-15531
> URL: https://issues.apache.org/jira/browse/SPARK-15531
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.1, 2.0.0
> Environment: Linux running in Univa or Sun Grid Engine
>Reporter: mathieu longtin
>Priority: Minor
>  Labels: launcher
>
> When running Java on a server with a lot of memory but a rather small virtual 
> memory ulimit, Java will try to allocate a large memory pool and fail:
> {code}
> # System has 128GB of Ram but ulimit set to 7.5G
> $ ulimit -v
> 7812500
> $ java -client
> Error occurred during initialization of VM
> Could not reserve enough space for object heap
> Error: Could not create the Java Virtual Machine.
> Error: A fatal exception has occurred. Program will exit.
> {code}
> This is a known issue with Java, but unlikely to get fixed.
> As a result, when starting various Spark process (spark-submit, master or 
> workers), they fail when {{spark-class}} tries to run 
> {{org.apache.spark.launcher.Main}}.
> To fix this, add {{-Xmx128m}} to this line
> {code}
> "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$\@"
> {code}
> (https://github.com/apache/spark/blob/master/bin/spark-class#L71)
> We've been using 128m and that works in our setup. Considering all the 
> launcher does is analyze the arguments and env var and spit out some command, 
> it should be plenty. All other calls to Java seem to include some value for 
> -Xmx, so it is not an issue elsewhere.
> I don't mind submitting a PR, but I'm sure somebody has opinions on the 128m 
> (bigger, smaller, configurable, ...), so I'd rather it would be discussed 
> first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15531) spark-class tries to use too much memory when running Launcher

2016-05-25 Thread mathieu longtin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300666#comment-15300666
 ] 

mathieu longtin commented on SPARK-15531:
-

The VM that spark-class launches afterwards has an -Xmx argument. It comes from 
--executor-memory or --driver-memory or some place else.

This is only a problem when letting Java decide what -Xmx should be. By 
default, it's a quarter of the physical memory, and it tries to reserve it 
right away, regardless of actual need.

> spark-class tries to use too much memory when running Launcher
> --
>
> Key: SPARK-15531
> URL: https://issues.apache.org/jira/browse/SPARK-15531
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.1, 2.0.0
> Environment: Linux running in Univa or Sun Grid Engine
>Reporter: mathieu longtin
>Priority: Minor
>  Labels: launcher
>
> When running Java on a server with a lot of memory but a rather small virtual 
> memory ulimit, Java will try to allocate a large memory pool and fail:
> {code}
> # System has 128GB of Ram but ulimit set to 7.5G
> $ ulimit -v
> 7812500
> $ java -client
> Error occurred during initialization of VM
> Could not reserve enough space for object heap
> Error: Could not create the Java Virtual Machine.
> Error: A fatal exception has occurred. Program will exit.
> {code}
> This is a known issue with Java, but unlikely to get fixed.
> As a result, when starting various Spark process (spark-submit, master or 
> workers), they fail when {{spark-class}} tries to run 
> {{org.apache.spark.launcher.Main}}.
> To fix this, add {{-Xmx128m}} to this line
> {code}
> "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$\@"
> {code}
> (https://github.com/apache/spark/blob/master/bin/spark-class#L71)
> We've been using 128m and that works in our setup. Considering all the 
> launcher does is analyze the arguments and env var and spit out some command, 
> it should be plenty. All other calls to Java seem to include some value for 
> -Xmx, so it is not an issue elsewhere.
> I don't mind submitting a PR, but I'm sure somebody has opinions on the 128m 
> (bigger, smaller, configurable, ...), so I'd rather it would be discussed 
> first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15531) spark-class tries to use too much memory when running Launcher

2016-05-25 Thread mathieu longtin (JIRA)
mathieu longtin created SPARK-15531:
---

 Summary: spark-class tries to use too much memory when running 
Launcher
 Key: SPARK-15531
 URL: https://issues.apache.org/jira/browse/SPARK-15531
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.6.1, 2.0.0
 Environment: Linux running in Univa or Sun Grid Engine
Reporter: mathieu longtin
Priority: Minor


When running Java on a server with a lot of memory but a rather small virtual 
memory ulimit, Java will try to allocate a large memory pool and fail:

{code}
# System has 128GB of Ram but ulimit set to 7.5G
$ ulimit -v
7812500
$ java -client
Error occurred during initialization of VM
Could not reserve enough space for object heap
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
{code}

This is a known issue with Java, but unlikely to get fixed.

As a result, when starting various Spark process (spark-submit, master or 
workers), they fail when {{spark-class}} tries to run 
{{org.apache.spark.launcher.Main}}.

To fix this, add {{-Xmx128m}} to this line
{code}
"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$\@"
{code}
(https://github.com/apache/spark/blob/master/bin/spark-class#L71)

We've been using 128m and that works in our setup. Considering all the launcher 
does is analyze the arguments and env var and spit out some command, it should 
be plenty. All other calls to Java seem to include some value for -Xmx, so it 
is not an issue elsewhere.

I don't mind submitting a PR, but I'm sure somebody has opinions on the 128m 
(bigger, smaller, configurable, ...), so I'd rather it would be discussed first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13266) Python DataFrameReader converts None to "None" instead of null

2016-02-22 Thread mathieu longtin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mathieu longtin updated SPARK-13266:

Comment: was deleted

(was: https://github.com/apache/spark/pull/11305)

> Python DataFrameReader converts None to "None" instead of null
> --
>
> Key: SPARK-13266
> URL: https://issues.apache.org/jira/browse/SPARK-13266
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
> Environment: Linux standalone but probably applies to all
>Reporter: mathieu longtin
>  Labels: easyfix, patch
>
> If you do something like this:
> {code:none}
> tsv_loader = sqlContext.read.format('com.databricks.spark.csv')
> tsv_loader.options(quote=None, escape=None)
> {code}
> The loader sees the string "None" as the _quote_ and _escape_ options. The 
> loader should get a _null_.
> An easy fix is to modify *python/pyspark/sql/readwriter.py* near the top, 
> correct the _to_str_ function. Here's the patch:
> {code:none}
> diff --git a/python/pyspark/sql/readwriter.py 
> b/python/pyspark/sql/readwriter.py
> index a3d7eca..ba18d13 100644
> --- a/python/pyspark/sql/readwriter.py
> +++ b/python/pyspark/sql/readwriter.py
> @@ -33,10 +33,12 @@ __all__ = ["DataFrameReader", "DataFrameWriter"]
>  def to_str(value):
>  """
> -A wrapper over str(), but convert bool values to lower case string
> +A wrapper over str(), but convert bool values to lower case string, and 
> keep None
>  """
>  if isinstance(value, bool):
>  return str(value).lower()
> +elif value is None:
> +return value
>  else:
>  return str(value)
> {code}
> This has been tested and works great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13266) Python DataFrameReader converts None to "None" instead of null

2016-02-22 Thread mathieu longtin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157260#comment-15157260
 ] 

mathieu longtin commented on SPARK-13266:
-

https://github.com/apache/spark/pull/11305

> Python DataFrameReader converts None to "None" instead of null
> --
>
> Key: SPARK-13266
> URL: https://issues.apache.org/jira/browse/SPARK-13266
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
> Environment: Linux standalone but probably applies to all
>Reporter: mathieu longtin
>  Labels: easyfix, patch
>
> If you do something like this:
> {code:none}
> tsv_loader = sqlContext.read.format('com.databricks.spark.csv')
> tsv_loader.options(quote=None, escape=None)
> {code}
> The loader sees the string "None" as the _quote_ and _escape_ options. The 
> loader should get a _null_.
> An easy fix is to modify *python/pyspark/sql/readwriter.py* near the top, 
> correct the _to_str_ function. Here's the patch:
> {code:none}
> diff --git a/python/pyspark/sql/readwriter.py 
> b/python/pyspark/sql/readwriter.py
> index a3d7eca..ba18d13 100644
> --- a/python/pyspark/sql/readwriter.py
> +++ b/python/pyspark/sql/readwriter.py
> @@ -33,10 +33,12 @@ __all__ = ["DataFrameReader", "DataFrameWriter"]
>  def to_str(value):
>  """
> -A wrapper over str(), but convert bool values to lower case string
> +A wrapper over str(), but convert bool values to lower case string, and 
> keep None
>  """
>  if isinstance(value, bool):
>  return str(value).lower()
> +elif value is None:
> +return value
>  else:
>  return str(value)
> {code}
> This has been tested and works great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13290) wholeTextFile and binaryFiles are really slow

2016-02-12 Thread mathieu longtin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mathieu longtin reopened SPARK-13290:
-

Slow relative to reading the exact same file on a local disk on the same 
machine. Python will read the same file *70 times* faster. I ran these tests a 
few times to make sure it's not a cache issue.

Point me to the code that reads files and maybe I can help. Just closing the 
bug doesn't mean the problem isn't there.

> wholeTextFile and binaryFiles are really slow
> -
>
> Key: SPARK-13290
> URL: https://issues.apache.org/jira/browse/SPARK-13290
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0
> Environment: Linux stand-alone
>Reporter: mathieu longtin
>
> Reading biggish files (175MB) with wholeTextFile or binaryFiles is extremely 
> slow. It takes 3 minutes in Java versus 2.5 seconds in Python.
> The java process balloons to 4.3GB of memory and uses 100% CPU the whole 
> time. I suspects Spark reads it in small chunks and assembles it at the end, 
> hence the large amount of CPU.
> {code}
> In [49]: rdd = sc.binaryFiles(pathToOneFile)
> In [50]: %time path, text = rdd.first()
> CPU times: user 1.91 s, sys: 1.13 s, total: 3.04 s
> Wall time: 3min 32s
> In [51]: len(text)
> Out[51]: 191376122
> In [52]: %time text = open(pathToOneFile).read()
> CPU times: user 8 ms, sys: 691 ms, total: 699 ms
> Wall time: 2.43 s
> In [53]: len(text)
> Out[53]: 191376122
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13290) wholeTextFile and binaryFiles are really slow

2016-02-11 Thread mathieu longtin (JIRA)
mathieu longtin created SPARK-13290:
---

 Summary: wholeTextFile and binaryFiles are really slow
 Key: SPARK-13290
 URL: https://issues.apache.org/jira/browse/SPARK-13290
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.6.0
 Environment: Linux stand-alone
Reporter: mathieu longtin


Reading biggish files (175MB) with wholeTextFile or binaryFiles is extremely 
slow. It takes 3 minutes in Java versus 2.5 seconds in Python.

The java process balloons to 4.3GB of memory and uses 100% CPU the whole time. 
I suspects Spark reads it in small chunks and assembles it at the end, hence 
the large amount of CPU.

{code}
In [49]: rdd = sc.binaryFiles(pathToOneFile)
In [50]: %time path, text = rdd.first()
CPU times: user 1.91 s, sys: 1.13 s, total: 3.04 s
Wall time: 3min 32s
In [51]: len(text)
Out[51]: 191376122
In [52]: %time text = open(pathToOneFile).read()
CPU times: user 8 ms, sys: 691 ms, total: 699 ms
Wall time: 2.43 s
In [53]: len(text)
Out[53]: 191376122
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13266) Python DataFrameReader converts None to "None" instead of null

2016-02-10 Thread mathieu longtin (JIRA)
mathieu longtin created SPARK-13266:
---

 Summary: Python DataFrameReader converts None to "None" instead of 
null
 Key: SPARK-13266
 URL: https://issues.apache.org/jira/browse/SPARK-13266
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.6.0
 Environment: Linux standalone but probably applies to all
Reporter: mathieu longtin


If you do something like this:
{code:none}
tsv_loader = sqlContext.read.format('com.databricks.spark.csv')
tsv_loader.options(quote=None, escape=None)
{code}

The loader sees the string "None" as the _quote_ and _escape_ options. The 
loader should get a _null_.

An easy fix is to modify *python/pyspark/sql/readwriter.py* near the top, 
correct the _to_str_ function. Here's the patch:

{code:none}
diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index a3d7eca..ba18d13 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -33,10 +33,12 @@ __all__ = ["DataFrameReader", "DataFrameWriter"]

 def to_str(value):
 """
-A wrapper over str(), but convert bool values to lower case string
+A wrapper over str(), but convert bool values to lower case string, and 
keep None
 """
 if isinstance(value, bool):
 return str(value).lower()
+elif value is None:
+return value
 else:
 return str(value)

{code}

This has been tested and works great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org