Setting the vote rate in a Random Forest in MLlib

2015-12-16 Thread Young, Matthew T
One of our data scientists is interested in using Spark to improve performance 
in some random forest binary classifications, but isn't getting good enough 
results from MLlib's implementation of the random forest compared to R's 
randomforest library with the available parameters. She suggested that if she 
could tune the vote rate of the forest (how many trees are required to vote 
true to cause a categorization) she might be able to reach the false positive 
and true positive targets for the project.

Is there any way to set the vote rate for a random forest in Spark 1.5.2? I 
don't see any such option in the trainClassifier 
API.

Thanks,

-- Matthew


RE: is repartition very cost

2015-12-08 Thread Young, Matthew T
Shuffling large amounts of data over the network is expensive, yes. The cost is 
lower if you are just using a single node where no networking needs to be 
involved to do the repartition (using Spark as a multithreading engine).

In general you need to do performance testing to see if a repartition is worth 
the shuffle time.

A common model is to repartition the data once after ingest to achieve 
parallelism and avoid shuffles whenever possible later.

From: Zhiliang Zhu [mailto:zchl.j...@yahoo.com.INVALID]
Sent: Tuesday, December 08, 2015 5:05 AM
To: User 
Subject: is repartition very cost


Hi All,

I need to do optimize objective function with some linear constraints by  
genetic algorithm.
I would like to make as much parallelism for it by spark.

repartition / shuffle may be used sometimes in it, however, is repartition API 
very cost ?

Thanks in advance!
Zhiliang




RE: capture video with spark streaming

2015-11-30 Thread Young, Matthew T
Unless it’s a network camera with the ability to request specific frame numbers 
for read, the answer is that you will just read from the camera like you 
normally would without Spark inside of a foreachrdd() and parallelize the 
result out for processing once you have it in a collection in the driver’s 
memory.

If the camera expects you to read continuously you will need to implement a 
custom receiver to constantly read from the camera and buffer the data until 
the next batch comes around.


From: Lavallen Pablo [mailto:intoe...@yahoo.com.ar]
Sent: Monday, November 30, 2015 5:07 PM
To: User 
Subject: capture video with spark streaming


Hello! Can anyone guide me please,  on how to capture video from a camera with 
spark streaming ? any article or book to read to recommend me ?

thank you





RE: How can you sort wordcounts by counts in stateful_network_wordcount.py example

2015-11-12 Thread Young, Matthew T
You can use 
foreachRDD
 to get access to the batch API in streaming jobs.


From: Amir Rahnama [mailto:amirrahn...@gmail.com]
Sent: Thursday, November 12, 2015 12:11 AM
To: ayan guha 
Cc: user 
Subject: Re: How can you sort wordcounts by counts in 
stateful_network_wordcount.py example

Hi Ayan,

Thanks for help,

Your example is not the streaming example. There we don't have sortByKey.



On Wed, Nov 11, 2015 at 11:35 PM, ayan guha 
> wrote:
how about this?

sorted = running_counts.map(lambda t: t[1],t[0]).sortByKey()

Basically swap key and value of the RDD and then sort?

On Thu, Nov 12, 2015 at 8:53 AM, Amir Rahnama 
> wrote:
Hey,

Anybody knows how can one sort the result in the stateful example?

Python would be prefered.

https://github.com/apache/spark/blob/859dff56eb0f8c63c86e7e900a12340c199e6247/examples/src/main/python/streaming/stateful_network_wordcount.py
--
Thanks and Regards,

Amir Hossein Rahnama

Tel: +46 (0) 761 681 102
Website: www.ambodi.com
Twitter: @_ambodi



--
Best Regards,
Ayan Guha



--
Thanks and Regards,

Amir Hossein Rahnama

Tel: +46 (0) 761 681 102
Website: www.ambodi.com
Twitter: @_ambodi


RE: Very slow performance on very small record counts

2015-11-03 Thread Young, Matthew T
+user to potentially help others

Cody,

Thanks for calling out isEmpty, I didn’t realize that it was so dangerous. 
Taking that out and just reusing the count has eliminated the issue, and now 
the cluster is happily eating 400,000 record batches.

For completeness’ sake: I am using the direct stream API.

From: Cody Koeninger [mailto:c...@koeninger.org]
Sent: Saturday, October 31, 2015 2:00 PM
To: YOUNG, MATTHEW, T (Intel Corp) <matthew.t.yo...@intel.com>
Subject: Re: Very slow performance on very small record counts

Have you looked at jstack or the thread dump from the spark ui during that time 
to see what's happening?

Are you using receiver based stream or direct stream?

The only odd thing I notice about your code is that you're calling isEmpty, 
which will do a take(), which can end up scheduling multiple times if it 
initially grabs empty partitions.  You're counting the rdd anyway, so why not 
just do count() first?



On Fri, Oct 30, 2015 at 5:38 PM, Young, Matthew T 
<matthew.t.yo...@intel.com<mailto:matthew.t.yo...@intel.com>> wrote:
In a job I am writing I have encountered impossibly poor performance with Spark 
Streaming 1.5.1. The environment is three 16 core/32 GB RAM VMs

The job involves parsing 600 bytes or so of JSON (per record) from Kafka, 
extracting two values from the JSON, doing some aggregation and averages, and 
writing a handful of summary results back to Kafka each two-second batch.

The issue is that Spark seems to be hitting a hard minimum of 4 seconds to 
process each batch, even a batch with as few as 192 records in it!

When I check the Web UI for such a batch I see a proper distribution of offsets 
(each worker gets < 10 records) and four jobs for the batch. Three of the jobs 
are completed very quickly (as I would expect), but one job essentially 
dominates the 4 seconds. This WebUI screenshot is presented in the attachment 
“Spark Idle Time 2.png”.

When I drill down into that job and look at the event timeline I see very odd 
behavior. The duration for the longest task is only ~0.3 s, and there is nice 
parallelism. What seems to be happening is right at the start of the job there 
is a few milliseconds of deserialization, followed by almost 4s(!) of doing 
absolutely nothing, followed by a few hundred milliseconds where the actual 
processing is taking place. This WebUI screenshot is presented in the 
attachment “Spark Idle Time.png”

What can cause this delay where Spark does nothing (or reports doing nothing) 
for so much time? I have included the code corresponding to the foreachRDD that 
is triggering this 4 second job below.

Thank you for your time,

-- Matthew


// Transmit Kafka config to all workers so they can write back as necessary
val broadcastBrokers = ssc.sparkContext.broadcast(brokerList)
val broadcastZookeeper = ssc.sparkContext.broadcast(zookeeper)

// Define the task for Spark Streaming to do
messages.foreachRDD(sourceStream => {

  // Parse the JSON into a more usable structure
  val parsed = sourceStream.map(y => parse(y._2))

  val GeosRUSECs = parsed.map {
x => ((x \ "geoip").extract[String](DefaultFormats, manifest[String]), 
((x \ "rusec").extract[Long](DefaultFormats, manifest[Long]), 1L))
  }.cache
  if (!GeosRUSECs.isEmpty) {
val totalRUSEC = GeosRUSECs.map(x => (x._2._1)).reduce(_ + _)
val avgRUSEC = totalRUSEC / GeosRUSECs.count.toDouble

if (!avgRUSEC.isNaN && !avgRUSEC.isInfinite) {
  // Acquire local Kafka connection
  val producer = kafkaWriter.getProducer(broadcastBrokers.value, 
broadcastZookeeper.value)
  producer.send(new KeyedMessage[String, String]("SparkRUSECout", 
avgRUSEC.toString))
}

// Wait times for each geo with total wait and number of queries
val GeosWaitsCounts = GeosRUSECs.reduceByKey((x, y) => (x._1 + y._1, 
x._2 + y._2))

val avgRespPerGeo = GeosWaitsCounts.map { case (geo, (totWait, 
numQueries)) => (geo, totWait.toDouble / numQueries) }

avgRespPerGeo.foreach { geoInfo =>
  val producer = kafkaWriter.getProducer(broadcastBrokers.value, 
broadcastZookeeper.value)
  producer.send(new KeyedMessage[String, String]("SparkRUSECout", 
geoInfo._1 + " average RUSEC: " + geoInfo._2))
}
  }
})



-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>



RE: Pulling data from a secured SQL database

2015-10-30 Thread Young, Matthew T
> Can the driver pull data and then distribute execution?



Yes, as long as your dataset will fit in the driver's memory. Execute arbitrary 
code to read the data on the driver as you normally would if you were writing a 
single-node application. Once you have the data in a collection on the driver's 
memory you can call 
sc.parallelize(data)
 to distribute the data out to the workers for parallel processing as an RDD. 
You can then convert to a dataframe if that is more appropriate for your 
workflow.





-Original Message-
From: Thomas Ginter [mailto:thomas.gin...@utah.edu]
Sent: Friday, October 30, 2015 10:49 AM
To: user@spark.apache.org
Subject: Pulling data from a secured SQL database



I am working in an environment where data is stored in MS SQL Server.  It has 
been secured so that only a specific set of machines can access the database 
through an integrated security Microsoft JDBC connection.  We also have a 
couple of beefy linux machines we can use to host a Spark cluster but those 
machines do not have access to the databases directly.  How can I pull the data 
from the SQL database on the smaller development machine and then have it 
distribute to the Spark cluster for processing?  Can the driver pull data and 
then distribute execution?



Thanks,



Thomas Ginter

801-448-7676

thomas.gin...@utah.edu











-

To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org For 
additional commands, e-mail: 
user-h...@spark.apache.org




RE: save DF to JDBC

2015-10-05 Thread Young, Matthew T
I’ve gotten it to work with SQL Server (with limitations; it’s buggy and 
doesn’t work with some types/operations).

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameWriter.html
 is the Java API you are looking for; the JDBC method lets you write to JDBC 
databases.

I haven’t tried Oracle database, but I would expect it to work at least 
somewhat.

From: Ruslan Dautkhanov [mailto:dautkha...@gmail.com]
Sent: Monday, October 05, 2015 2:44 PM
To: user 
Subject: save DF to JDBC

http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases

Spark JDBC can read data from JDBC, but can it save back to JDBC?
Like to an Oracle database through its jdbc driver.

Also looked at SQL Context documentation
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/SQLContext.html
and can't find anything relevant.

Thanks!


--
Ruslan Dautkhanov


Reasonable performance numbers?

2015-09-24 Thread Young, Matthew T
Hello,

I am doing performance testing with Spark Streaming. I want to know if the 
throughput numbers I am encountering are reasonable for the power of my cluster 
and Spark's performance characteristics.

My job has the following processing steps:

1.  Read 600 Byte JSON strings from a 7 broker / 48 partition Kafka cluster 
via the Kafka Direct API

2.  Parse the JSON with play-json or lift-json (no significant performance 
difference)

3.  Read one integer value out of the JSON

4.  Compute the average of this integer value across all records in the 
batch with DoubleRDD.mean

5.  Write the average for the batch back to a different Kafka topic

I have tried 2, 4, and 10 second batch intervals. The best throughput I can 
sustain is about 75,000 records/second for the whole cluster.

The Spark cluster is in a VM environment with 3 VMs. Each VM has 32 GB of RAM 
and 16 cores. The systems are networked with 10 GB NICs. I started testing with 
Spark 1.3.1 and switched to Spark 1.5 to see if there was improvement (none 
significant). When I look at the event timeline in the WebUI I see that the 
majority of the processing time for each batch is "Executor Computing Time" in 
the foreachRDD that computes the average, not the transform that does the JSON 
parsing.

CPU util hovers around 40% across the cluster, and RAM has plenty of free space 
remaining as well. Network comes nowhere close to being saturated.

My colleague implementing similar functionality in Storm is able to exceed a 
quarter million records per second with the same hardware.

Is 75K records/seconds reasonable for a cluster of this size? What kind of 
performance would you expect for this job?


Thanks,

-- Matthew


Getting number of physical machines in Spark

2015-08-27 Thread Young, Matthew T
What's the canonical way to find out the number of physical machines in a 
cluster at runtime in Spark? I believe SparkContext.defaultParallelism will 
give me the number of cores, but I'm interested in the number of NICs.

I'm writing a Spark streaming application to ingest from Kafka with the 
Receiver API and want to create one DStream per physical machine for read 
parallelism's sake. How can I figure out at run time how many machines there 
are so I know how many DStreams to create?


RE: How to read a Json file with a specific format?

2015-07-29 Thread Young, Matthew T
{IFAM:EQR,KTM:143000640,COL:21,DATA:[{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2}]}
{IFAM:EQR,KTM:143000640,COL:22,DATA:[{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2},{MLrate:30,Nrout:0,up:null,Crate:2}]}





From: mélanie gallois [melanie.galloi...@gmail.com]
Sent: Wednesday, July 29, 2015 8:10 AM
To: Young, Matthew T
Cc: user@spark.apache.org
Subject: Re: How to read a Json file with a specific format?

Can you give an example with my extract?

Mélanie Gallois

2015-07-29 16:55 GMT+02:00 Young, Matthew T 
matthew.t.yo...@intel.commailto:matthew.t.yo...@intel.com:
The built-in Spark JSON functionality cannot read normal JSON arrays. The 
format it expects is a bunch of individual JSON objects without any outer array 
syntax, with one complete JSON object per line of the input file.

AFAIK your options are to read the JSON in the driver and parallelize it out to 
the workers or to fix your input file to match the spec.

For one-off conversions I usually use a combination of jq and regex-replaces to 
get the source file in the right format.


From: SparknewUser 
[melanie.galloi...@gmail.commailto:melanie.galloi...@gmail.com]
Sent: Wednesday, July 29, 2015 6:37 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: How to read a Json file with a specific format?

I'm trying to read a Json file which is like :
[
{IFAM:EQR,KTM:143000640,COL:21,DATA:[{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
]}
,{IFAM:EQR,KTM:143000640,COL:22,DATA:[{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
]}
]

I've tried the command:
val df = sqlContext.read.json(namefile)
df.show()


But this does not work : my columns are not recognized...





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-Json-file-with-a-specific-format-tp24061.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org




--
Mélanie

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: IP2Location within spark jobs

2015-07-29 Thread Young, Matthew T
You can put the database files in a central location accessible to all the 
workers and build the GeoIP object once per-partition when you go to do a 
mapPartitions across your dataset, loading from the central location.


___


From: Filli Alem [alem.fi...@ti8m.ch]

Sent: Wednesday, July 29, 2015 12:04 PM

To: user@spark.apache.org

Subject: IP2Location within spark jobs










Hi,
 
I would like to use ip2Location databases during my spark jobs (MaxMind).
So far I haven’t found a way to properly serialize the database offered by the 
Java API of the database.

The CSV version isn’t easy to handle as it contains of multiple files.
 
Any recommendations on how to do this?
 
Thanks
Alem
 


















-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: How to read a Json file with a specific format?

2015-07-29 Thread Young, Matthew T
The built-in Spark JSON functionality cannot read normal JSON arrays. The 
format it expects is a bunch of individual JSON objects without any outer array 
syntax, with one complete JSON object per line of the input file.

AFAIK your options are to read the JSON in the driver and parallelize it out to 
the workers or to fix your input file to match the spec.

For one-off conversions I usually use a combination of jq and regex-replaces to 
get the source file in the right format.


From: SparknewUser [melanie.galloi...@gmail.com]
Sent: Wednesday, July 29, 2015 6:37 AM
To: user@spark.apache.org
Subject: How to read a Json file with a specific format?

I'm trying to read a Json file which is like :
[
{IFAM:EQR,KTM:143000640,COL:21,DATA:[{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
]}
,{IFAM:EQR,KTM:143000640,COL:22,DATA:[{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
,{MLrate:30,Nrout:0,up:null,Crate:2}
]}
]

I've tried the command:
val df = sqlContext.read.json(namefile)
df.show()


But this does not work : my columns are not recognized...





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-Json-file-with-a-specific-format-tp24061.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Issue with column named count in a DataFrame

2015-07-23 Thread Young, Matthew T
Thanks Michael, using backticks resolves the issue.

Wouldn't this fix also be something that should go into Spark 1.4.2, or at 
least have the limitation noted in the documentation?



From: Michael Armbrust [mich...@databricks.com]
Sent: Wednesday, July 22, 2015 4:26 PM
To: Young, Matthew T
Cc: user@spark.apache.org
Subject: Re: Issue with column named count in a DataFrame

Additionally have you tried enclosing count in `backticks`?

On Wed, Jul 22, 2015 at 4:25 PM, Michael Armbrust 
mich...@databricks.commailto:mich...@databricks.com wrote:
I believe this will be fixed in Spark 1.5

https://github.com/apache/spark/pull/7237

On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T 
matthew.t.yo...@intel.commailto:matthew.t.yo...@intel.com wrote:
I'm trying to do some simple counting and aggregation in an IPython notebook 
with Spark 1.4.0 and I have encountered behavior that looks like a bug.

When I try to filter rows out of an RDD with a column name of count I get a 
large error message. I would just avoid naming things count, except for the 
fact that this is the default column name created with the count() operation in 
pyspark.sql.GroupedData

The small example program below demonstrates the issue.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
dataFrame = sc.parallelize([(foo,), (foo,), (bar,)]).toDF([title])
counts = dataFrame.groupBy('title').count()
counts.filter(title = 'foo').show() # Works
counts.filter(count  1).show() # Errors out


I can even reproduce the issue in a PySpark shell session by entering these 
commands.

I suspect that the error has something to with Spark wanting to call the 
count() function in place of looking at the count column.

The error message is as follows:


Py4JJavaError Traceback (most recent call last)
ipython-input-29-62a1b7c71f21 in module()
 1 counts.filter(count  1).show() # Errors Out

C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc
 in filter(self, condition)
774 
775 if isinstance(condition, basestring):
-- 776 jdf = self._jdf.filter(condition)
777 elif isinstance(condition, Column):
778 jdf = self._jdf.filter(condition._jc)

C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
-- 538 self.target_id, self.namehttp://self.name)

539
540 for temp_arg in temp_args:

C:\Python27\lib\site-packages\py4j\protocol.pyc in get_return_value(answer, 
gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
-- 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling o229.filter.
: java.lang.RuntimeException: [1.7] failure: ``('' expected but `' found

count  1
  ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)



Is there a recommended workaround to the inability to filter on a column named 
count? Do I have to make a new DataFrame and rename the column just to work 
around this bug? What's the best way to do that?

Thanks,

-- Matthew Young

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Issue with column named count in a DataFrame

2015-07-22 Thread Young, Matthew T
I'm trying to do some simple counting and aggregation in an IPython notebook 
with Spark 1.4.0 and I have encountered behavior that looks like a bug.

When I try to filter rows out of an RDD with a column name of count I get a 
large error message. I would just avoid naming things count, except for the 
fact that this is the default column name created with the count() operation in 
pyspark.sql.GroupedData

The small example program below demonstrates the issue.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
dataFrame = sc.parallelize([(foo,), (foo,), (bar,)]).toDF([title])
counts = dataFrame.groupBy('title').count()
counts.filter(title = 'foo').show() # Works
counts.filter(count  1).show() # Errors out


I can even reproduce the issue in a PySpark shell session by entering these 
commands.

I suspect that the error has something to with Spark wanting to call the 
count() function in place of looking at the count column.

The error message is as follows:


Py4JJavaError Traceback (most recent call last)
ipython-input-29-62a1b7c71f21 in module()
 1 counts.filter(count  1).show() # Errors Out

C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc
 in filter(self, condition)
774 
775 if isinstance(condition, basestring):
-- 776 jdf = self._jdf.filter(condition)
777 elif isinstance(condition, Column):
778 jdf = self._jdf.filter(condition._jc)

C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
-- 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:

C:\Python27\lib\site-packages\py4j\protocol.pyc in get_return_value(answer, 
gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
-- 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling o229.filter.
: java.lang.RuntimeException: [1.7] failure: ``('' expected but `' found

count  1
  ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)



Is there a recommended workaround to the inability to filter on a column named 
count? Do I have to make a new DataFrame and rename the column just to work 
around this bug? What's the best way to do that?

Thanks,

-- Matthew Young

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Would driver shutdown cause app dead?

2015-07-21 Thread Young, Matthew T
ZhuGe,

If you run your program in the cluster deploy-mode you get resiliency against 
driver failure, though there are some steps you have to take in how you write 
your streaming job to allow for transparent resume. Netflix did a nice writeup 
of this resiliency 
herehttp://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html.

If you tie in ZooKeeper you can also get resiliency against Master failure, 
which has some documentation 
herehttp://spark.apache.org/docs/1.4.0/spark-standalone.html.

Regards,

-- Matthew


From: ZhuGe [t...@outlook.com]
Sent: Tuesday, July 21, 2015 3:07 AM
To: user@spark.apache.org
Subject: Would driver shutdown cause app dead?

Hi all:
I am a bit confuse about the work of driver.
In our productin enviroment, we have a spark streaming app running in standone 
mode. what we concern is that if  the driver shutdown accidently(host shutdown 
or whatever). would the app  running normally?

Any explanation would be appreciated!!

Cheers

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark and SQL Server

2015-07-20 Thread Young, Matthew T
When attempting to write a Dataframe to SQL Server that contains 
java.sql.Timestamp or java.lang.boolean objects I get errors about the query 
that is formed being invalid. Specifically, java.sql.Timestamp objects try to 
be written as the Timestamp type, which is not appropriate for date/time 
storage in TSQL. They should be datetimes or datetime2s.

java.lang.boolean errors out because it tries to specify the width of the BIT 
field, which SQL Server doesn't like.

However, I can write strings/varchars and ints without any issues.



From: Davies Liu [dav...@databricks.com]
Sent: Monday, July 20, 2015 9:08 AM
To: Young, Matthew T
Cc: user@spark.apache.org
Subject: Re: Spark and SQL Server

Sorry for the confusing. What's the other issues?

On Mon, Jul 20, 2015 at 8:26 AM, Young, Matthew T
matthew.t.yo...@intel.com wrote:
 Thanks Davies, that resolves the issue with Python.

 I was using the Java/Scala DataFrame documentation 
 https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrameWriter.html
  and assuming that it was the same for PySpark 
 http://spark.apache.org/docs/1.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.
  I will keep this distinction in mind going forward.

 I guess we have to wait for Microsoft to release an SQL Server connector for 
 Spark to resolve the other issues.

 Cheers,

 -- Matthew Young

 
 From: Davies Liu [dav...@databricks.com]
 Sent: Saturday, July 18, 2015 12:45 AM
 To: Young, Matthew T
 Cc: user@spark.apache.org
 Subject: Re: Spark and SQL Server

 I think you have a mistake on call jdbc(), it should be:

 jdbc(self, url, table, mode, properties)

 You had use properties as the third parameter.

 On Fri, Jul 17, 2015 at 10:15 AM, Young, Matthew T
 matthew.t.yo...@intel.com wrote:
 Hello,

 I am testing Spark interoperation with SQL Server via JDBC with Microsoft’s 
 4.2 JDBC Driver. Reading from the database works ok, but I have encountered 
 a couple of issues writing back. In Scala 2.10 I can write back to the 
 database except for a couple of types.


 1.  When I read a DataFrame from a table that contains a datetime column 
 it comes in as a java.sql.Timestamp object in the DataFrame. This is alright 
 for Spark purposes, but when I go to write this back to the database with 
 df.write.jdbc(…) it errors out because it is trying to write the TimeStamp 
 type to SQL Server, which is not a date/time storing type in TSQL. I think 
 it should be writing a datetime, but I’m not sure how to tell Spark this.



 2.  A related misunderstanding happens when I try to write a 
 java.lang.boolean to the database; it errors out because Spark is trying to 
 specify the width of the bit type, which is illegal in SQL Server (error 
 msg: Cannot specify a column width on data type bit). Do I need to edit 
 Spark source to fix this behavior, or is there a configuration option 
 somewhere that I am not aware of?


 When I attempt to write back to SQL Server in an IPython notebook, py4j 
 seems unable to convert a Python dict into a Java hashmap, which is 
 necessary for parameter passing. I’ve documented details of this problem 
 with code examples 
 herehttp://stackoverflow.com/questions/31417653/java-util-hashmap-missing-in-pyspark-session.
  Any advice would be appreciated.

 Thank you for your time,

 -- Matthew Young

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark and SQL Server

2015-07-20 Thread Young, Matthew T
Thanks Davies, that resolves the issue with Python.

I was using the Java/Scala DataFrame documentation 
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrameWriter.html
 and assuming that it was the same for PySpark 
http://spark.apache.org/docs/1.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.
 I will keep this distinction in mind going forward.

I guess we have to wait for Microsoft to release an SQL Server connector for 
Spark to resolve the other issues.

Cheers,

-- Matthew Young


From: Davies Liu [dav...@databricks.com]
Sent: Saturday, July 18, 2015 12:45 AM
To: Young, Matthew T
Cc: user@spark.apache.org
Subject: Re: Spark and SQL Server

I think you have a mistake on call jdbc(), it should be:

jdbc(self, url, table, mode, properties)

You had use properties as the third parameter.

On Fri, Jul 17, 2015 at 10:15 AM, Young, Matthew T
matthew.t.yo...@intel.com wrote:
 Hello,

 I am testing Spark interoperation with SQL Server via JDBC with Microsoft’s 
 4.2 JDBC Driver. Reading from the database works ok, but I have encountered a 
 couple of issues writing back. In Scala 2.10 I can write back to the database 
 except for a couple of types.


 1.  When I read a DataFrame from a table that contains a datetime column 
 it comes in as a java.sql.Timestamp object in the DataFrame. This is alright 
 for Spark purposes, but when I go to write this back to the database with 
 df.write.jdbc(…) it errors out because it is trying to write the TimeStamp 
 type to SQL Server, which is not a date/time storing type in TSQL. I think it 
 should be writing a datetime, but I’m not sure how to tell Spark this.



 2.  A related misunderstanding happens when I try to write a 
 java.lang.boolean to the database; it errors out because Spark is trying to 
 specify the width of the bit type, which is illegal in SQL Server (error msg: 
 Cannot specify a column width on data type bit). Do I need to edit Spark 
 source to fix this behavior, or is there a configuration option somewhere 
 that I am not aware of?


 When I attempt to write back to SQL Server in an IPython notebook, py4j seems 
 unable to convert a Python dict into a Java hashmap, which is necessary for 
 parameter passing. I’ve documented details of this problem with code examples 
 herehttp://stackoverflow.com/questions/31417653/java-util-hashmap-missing-in-pyspark-session.
  Any advice would be appreciated.

 Thank you for your time,

 -- Matthew Young

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark and SQL Server

2015-07-17 Thread Young, Matthew T
Hello,

I am testing Spark interoperation with SQL Server via JDBC with Microsoft’s 4.2 
JDBC Driver. Reading from the database works ok, but I have encountered a 
couple of issues writing back. In Scala 2.10 I can write back to the database 
except for a couple of types.


1.  When I read a DataFrame from a table that contains a datetime column it 
comes in as a java.sql.Timestamp object in the DataFrame. This is alright for 
Spark purposes, but when I go to write this back to the database with 
df.write.jdbc(…) it errors out because it is trying to write the TimeStamp type 
to SQL Server, which is not a date/time storing type in TSQL. I think it should 
be writing a datetime, but I’m not sure how to tell Spark this.



2.  A related misunderstanding happens when I try to write a 
java.lang.boolean to the database; it errors out because Spark is trying to 
specify the width of the bit type, which is illegal in SQL Server (error msg: 
Cannot specify a column width on data type bit). Do I need to edit Spark source 
to fix this behavior, or is there a configuration option somewhere that I am 
not aware of?


When I attempt to write back to SQL Server in an IPython notebook, py4j seems 
unable to convert a Python dict into a Java hashmap, which is necessary for 
parameter passing. I’ve documented details of this problem with code examples 
herehttp://stackoverflow.com/questions/31417653/java-util-hashmap-missing-in-pyspark-session.
 Any advice would be appreciated.

Thank you for your time,

-- Matthew Young

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org