Re: is there a way to persist the lineages generated by spark?

2017-04-06 Thread Jörn Franke
I do think this is the right way, you will have to do testing with test data verifying that the expected output of the calculation is the output. Even if the logical Plan Is correct your calculation might not be. E.g. There can be bugs in Spark, in the UI or (what is very often) the client

[Discuss][Spark staging dir] way to disable spark writing to _temporary

2017-04-06 Thread Yash Sharma
Hi All, This is another issue that I was facing with the spark - s3 operability and wanted to ask to the broader community if its faced by anyone else. I have a rather simple aggregation query with a basic transformation. The output however has lot of output partitions (20K partitions). The spark

[Streaming][Kinesis] Please review the kinesis-spark hard codings pull request

2017-04-06 Thread Yash Sharma
Hi fellow Spark Devs, If anyone here has some experience in spark kinesis streaming, would it be possible to provide your thoughts on this pull request [1]. Some info: The patch removes two important hard coded values for kinesis retries and will make kinesis recovery from crashes more reliable.

Re: [Pyspark, SQL] Very slow IN operator

2017-04-06 Thread Fred Reiss
If you just want to emulate pushing down a join, you can just wrap the IN list query in a JDBCRelation directly: scala> val r_df = spark.read.format("jdbc").option("url", > "jdbc:h2:/tmp/testdb").option("dbtable", "R").load() > r_df: org.apache.spark.sql.DataFrame = [A: int] > scala> r_df.show >

Re: [Pyspark, SQL] Very slow IN operator

2017-04-06 Thread Maciej Bryński
2017-04-06 4:00 GMT+02:00 Michael Segel : > Just out of curiosity, what would happen if you put your 10K values in to a > temp table and then did a join against it? The answer is predicates pushdown. In my case I'm using this kind of query on JDBC table and IN